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DOCUMENT PROCESSOR, DOCUMENT CLASSIFICATION DEVICE, 
DOCUMENT PROCESSING METHOD, DOCUMENT CLASSIFICATION METHOD, 
AND COMPUTER-READABLE RECORDING MEDIUM 
FOR RECORDING PROGRAMS FOR 
5 EXECUTING THE METHODS ON A COMPUTER 

FTFJiD OF THE INVENTION 

The present invention relates to a document processor for 
displaying and printing multiple input document data in a 

10 predetermined format, a document processing method, and a 
computer-readable recording medium for recording a program to 
execute the method on a computer. Furthermore, this invention 
relates to a document classification device and a document 
classification method for classifying multiple input document 

15 data based on the contents thereof, and particularly for 
refining classification categories calculated during document 
classification, and to a computer-readable recording medium for 
recording a program to execute the method on a computer. 

20 BACKGROUND OF THF, INVENTION 

Various document classification devices and document 
retrieval devices have been developed in recent years. The 
proliferation of network technology, such as the Internet, has 
made it possible to access a huge amount of electronic documents, 
25 domestically and overseas, and there has been a proportionate 

1 



rapid expansion in the amount of data which is stored 
electronically. Accordingly, there is an increasing need for 
intellectual operations such as classifying large collections 
of document data into meaningful categories. 

The benefits of classifying large amounts of document 
data according to their meaning are as follows. Firstly, it 
makes it easier to retrieve data. Retrieval becomes relatively 
easy since vast groups of documents can be retrieved using 
category names as clues. 

Secondly, entire groups of data can be grasped. That is, 
it is possible to grasp the contents (individual 
classifications) of an entire cluster of documents. However, 
when a large amount of document data is classified by an operator, 
although accurate classification can be achieved, 
classification requires enormous manpower and time. 
Consequently, in view of the huge amount of documents stored 
in recent years, devices for automatically classifying document 
data have been proposed. 

As an example of a conventional device for automatically 
classifying documents, Japanese Patent Application Laid-open 
(JP-A) No. 7-36897 discloses a device which defines a document 
as a document vector characterized by a word, uses clustering 
to group these document vectors, and automatically classifies 
the documents based on the grouped document vectors. 

Furthermore, in "Projections for Efficient Document 



Clustering (Authors: Hinrich Schutze and Craing Silverstein, 
Academy: ACM, Title of Paper : Proceedings of SIGIR, pages: 78-81, 
Year of Publication: 1997) " documents are classified in dormant 
meaning space. Other conceivable methods include using a 
probability theory approach, etc. 

Furthermore, in recent years, the proliferation of the 
Internet and the like has made it possible to access large 
amounts of document clusters, and as a result, there is an 
increasing need to be able use these document clusters 
effectively, and in accordance with the intentions of a variety 
of users. To accomplish this, an intellectual operation is 
starting to be used in which a large amount of document clusters 
is classified into meaningful categories, and the structure of 
the document clusters is grasped. However, when this type of 
classification is performed manually, enormous manpower and 
time are required. Further, since only the classifier knows 
how to classify the document data, classification standard 
change when the person responsible for classification is 
replaced. 

Consequently, there is a demand for a document 
classification device capable of automatically classifying 
groups of documents according to the same type of classification 
standards used by humans . For example, as disclosed in Japanese 
Patent Application Laid-open (JP-A) No. 7-114572, a document 
classification device capable of automatically extracting a 



word characteristic vector from a document, and classifying the 
document based on the characteristic vector, thereby making it 
possible to automatically classify the documents using 
meaningful differences. 

However, since the conventional document classification 
device described above uses a method for statistically 
classifying documents arranged in multi-dimensional space 
essentially comprising words, the result of the classification 
is nothing more than the statistically determined behaviour of 
the words. Consequently, clusters (partial groups of 
individual classified documents) calculated after 
classification are sometimes incomprehensible to the operator 
(user) . 

A further problem is that the question of what kind of 
classification is appropriate depends on the characteristics 
of the document clusterings to be classified and the intentions 
of the user, making it difficult to define an appropriate 
classification. In particular, when grasping entire data 
groups as mentioned above, the type of classification required 
will differ depending on the widely varying intentions of the 
operators, and it will be difficult to obtain the result desired 
by the operator in a single classification. 

Thus, the problem can be interpreted by saying that a 
document classification result includes a great amount of noise, 
only one part of which is of use to the operator. 



Furthermore, the conventional technology does not 
consider the constitutional units of the document, and in a case 
where the structure of a document is partitioned by one or 
multiple period symbols, titles, and the like, multiple topics 
and meanings are contained in a single document. This results 
in problems that it is difficult for a user to understand the 
classification categories, the category may be limited to a 
specific topic or specific meaning, or the document may be 
classified under a category different to that intended by the 
user . 

A context-dependent automatic classification device is 
disclosed in Japanese Patent Application Laid-open (JP-A) No. 
6-176064, and aims to increase classification precision by 
automatically classifying documents in consideration of the 
conclusive data therein, but essentially does not solve the 
problems mentioned above. 

Furthermore, conventional document processors, such as 
the document classification device and document retrieve device 
described above, merely classify or retrieve documents, and 
give no consideration to further analysis of information hidden 
in the document clusters. Consequently, they have a 
disadvantage that a separate analyzing device must be used to 
analyze information hidden in the document clusters. 

Furthermore, the operator who wishes to analyze the 
information does not perform classification and retrieval as 



an end in itself, but simply as an intermediate Step during his 
analysis of the information. After classification and 
retrieval, in order to grasp the result more easily it is usually 
necessary to derive a meaningful result from the information 
5 analysis by repeating a variety of other processes, such as 
maximizing the practical usefulness of the information included 
in the original document, rearranging the result, carrying out 
totalization and statistical processing, and drawing up charts 
and graphs based on the results. 

10 Furthermore, table-calculating software is sometimes 

needed when analyzing information about numerical data. 
However, table-calculating software was originally developed 
to handle numerical data, and is not sufficiently effective for 
analyzing textual data, particularly when the analysis concerns 

15 the meaning of documents. 

SUMMARY OF TP F. INVENTION 

This invention has been achieved in order to solve the 
problems of the conventional examples described above. It is 

20 a first object of the present invention to provide a document 
processor, a document processing method, and a computer- 
readable recording medium storing programs for executing the 
method on a computer, for carrying out analysis concerning the 
meaning of documents, not simply by outputting the results of 

25 fixed functions such as classification and retrieval, but by 



supporting a complete range of information analysis. 

To solve the problems of the conventional example 
described above, it is a second object of the present invention 
to provide a document classification device and a document 
classification method capable of momentarily determining what 
type of contents are contained in a given document cluster, and 
a computer-readable recording medium for storing programs for 
executing the method on a computer. 

Furthermore, to solve the problems of the conventional 
example described above, it is a third object of the present 
invention to provide a document classification device and a 
document classification method wherein, when one document 
contains multiple topics and meanings, these can be classified 
into categories according to specific topics and meanings, so 
that the classifications do not differ from categories desired 
by a user, thereby enabling the user to easily comprehend the 
classification categories, and a computer-readable recording 
medium for storing programs for executing the method on a 
computer . 

In order to solve the problems mentioned above, the 
document processor according to one aspect of the present 
invention for displaying and printing in a predetermined format 
multiple input document data, comprises a document memory unit 
for storing input document data; a selection unit for selecting 
all or part of document data stored in the documents memory unit; 



a characteristics extraction unit for extracting data relating 
to characteristics of letter rows from all or part of the 
document data selected by the selection unit; a work processing 
unit for work-processing all or part of the document data based 
5 on the data relating to characteristics of letter rows extracted 
by the characteristics extraction unit; and an output unit for 
outputting all or part of the document data work-processed by 
the work processing unit. 

According to the above aspect of this invention, when 

10 analyzing documents according to their meanings, rather than 
merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

Further, the output unit of the document processor 
comprises an item value set unit for setting a plurality of item 

15 values based on the contents of all or part of the document data 
work-processed by the work-processing unit; and a totalization 
unit for totalizing all or part of the document data for each 
item value set by the item value set unit. Furthermore, the 
output unit outputs all or part of the document data in the format 

20 of a table having an item value as at least one axis. 

Hence the result of the work-processing can easily be 
expressed in a cross table, and the contents of the information 
can easily be grasped. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 

25 result of the analysis, the entire information analysis 



operation can be supported. 

Further, the output unit outputs all or part of the 
document data work-processed by the work processing unit 
together with all or part of the document data in its state prior 
5 to work-processing by the work processing unit. 

Hence data to be work-processed and other data can be 
displayed simultaneously and identified, whereby the range of 
the work-processing to be carried out can be accurately and 
easily determined. Therefore, when analyzing documents 
10 according to their meanings, rather than merely outputting the 
result of the analysis, the entire information analysis 
operation can be supported. 

Further, the document memory unit also stores all or part 
of the document data work-processed by the work processing unit . 
15 Since other data can be handled simultaneously, when 

thereafter analyzing documents according to their meanings, 
rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the selection unit further selects all or part 
20 of the document data output by the output unit. 

Since all or part of the document data output by the output 
unit can be selected for analysis, a wide variety of information 
can be analyzed with high precision . Therefore, when analyzing 
documents according to their meanings, rather than merely 
25 outputting the result of the analysis, the entire information 



analysis operation can be supported. 

Further, the document memory unit further stores data 
relating to contents of the work processing. 

Hence not only can loss of data relating to the contents 
5 of work-processing can be prevented and the data managed easily, 
but also the relationship between settings used in the 
work-processing and the processed result can be determined. 
Therefore, when analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 

10 entire information analysis operation can be supported. 

A document classification device for classifying 
documents based on contents thereof according to another aspect 
of the present invention comprises an input unit for inputting 
document data; a language analyzer unit for analyzing document 

15 data input by the input unit and obtaining language analysis 
information; a vector creation unit for document characteristic 
vectors for the document data based on the language analysis 
information obtained by the language analyzer unit; a 
classification unit for classifying documents based on the 

20 degree of similarity between document characteristic vectors 
created by the vector creation unit, and creating clusters of 
documents; a cluster characteristics calculation unit for 
calculating cluster characteristics, which are 
characteristics of clusters of documents created by the 

25 classification unit; and a classification category memory unit 



for storing cluster characteristics, calculated by the cluster 
characteristics calculation unit, as constituent elements of 
classification categories. 

According to the above aspect of this invention, it is 
5 possible to obtain clusters, and to structure and categorize 
the clusters based on their contents using their degree of 
similarity to the cluster center, and the like. 

A document classification device for classifying 
documents based on contents thereof according to still another 

10 aspect of the present invention comprises an input unit for 
inputting document data; a language analyzer unit for analyzing 
document data input by the input unit and obtaining language 
analysis information; a vector creation unit for creating 
document characteristic vectors for the document data based on 

15 the language analysis information obtained by the language 
analyzer unit; a classification unit for classifying documents 
based on the degree of similarity between document 
characteristic vectors created by the vector creation unit, and 
creating clusters of documents; a cluster characteristics 

20 calculation unit for calculating cluster characteristics, 
which are characteristics of clusters of documents created by 
the classification unit; a display unit for displaying the 
cluster characteristics calculated by the cluster 
characteristics calculation unit; a cluster selection 

25 specification unit for selecting predetermined clusters from 

11 



cluster of documents created by the classification unit; and 
a classification category memory unit for storing cluster 
characteristics, calculated by the cluster characteristics 
calculation unit, as constituent elements of classification 
5 categories . 

According to the above aspect of this invention, only 
selected clusters are used, making it possible to structure and 
categorize to clusters in a manner closer to that desired by 
the operator. 

10 Further, the arrangement of the present invention 

described above further comprises a document characteristic 
vector memory unit for storing document characteristic vectors 
created by vector creation unit; and a vector correction unit 
for correcting document characteristic vectors stored in the 

15 document characteristic vector memory unit, so that document 
characteristic vectors of documents belonging to clusters 
selected by the cluster selection unit are deleted. 
Furthermore, the classification unit classifies documents 
based on the document characteristic vectors corrected by the 

20 vector correction unit. 

Hence the effects of clusters which are already known can 
be eliminated, and new clusters can be created. 

Further, the document classification device of the 
present invention further comprises a document characteristic 

25 vector memory unit for storing document characteristic vectors 



created by vector creation unit; and a document expression space 
correction unit for correcting document expression space when 
determining the degree of similarity between document 
characteristic vectors stored in the document characteristic 
5 vectors memory unit, based on a characteristics amount 
calculated from clusters selected by the cluster selection unit . 
Furthermore, the classification unit classifies documents 
based on the degree of similarity between document 
characteristic vectors created by the vector creation unit, 

10 using the document expression space corrected by the document 
expression space correction unit. 

Hence, cluster characteristics selected by the operator 
in the previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

15 Further, the document classification device according to 

the present invention further comprises a document 
characteristic vector memory unit for storing document 
characteristic vectors created by vector creation unit; and a 
document expression space correction unit for correcting 

20 document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
the document characteristic vectors memory unit, based on a 
characteristics amount calculated from clusters selected by the 
cluster selection unit. Furthermore, the classification unit 

25 classifies documents based on the degree of similarity between 



document characteristic vectors created by the vector creation 
unit, using the document expression space corrected by the 
document expression space correction unit. 

Hence influences of the known cluster can be eliminated 
5 and cluster characteristics selected by the operator in the 
previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

Further, the document classification device of the 
present invention further comprises a selection information 

10 appending unit for appending selection information showing the 
fact of selection when all or part of the documents belonging 
to a cluster of documents created by the classification unit 
have been selected. Furthermore, the display unit displays the 
cluster characteristics, and also displays the selection 

15 information appended by the selection information appending 
unit . 

Hence it is possible to improve the ability to identify 
documents used on multiple occasions, and the ability to 
identify documents which have not been selected at all. 

20 Further, the classification category memory unit stores 

cluster characteristics and/or information created by an 
operator, in addition to all or part of the documents belonging 
to a cluster of documents selected by the selection 
specification unit, as constituent elements of classification 

25 categories. 



Hence the contents of clusters can be easily recognized, 
and in addition, the operator can easily create his own 
classification categories, thereby improving the usefulness of 
the classification categories. 
5 A document classification device for classifying 

document clusters in accordance with contents thereof according 
to still another aspect of the present invention comprises a 
document input unit for inputting document data groups; a 
document dividing unit for dividing document data into one or 

10 multiple divided document data based on a predetermined 
reference; a document-divided document map creation unit for 
creating a map showing the correspondence between the document 
data and the divided document data; a divided document 
classification unit for classifying the divided document data; 

15 a divided document classification result creation unit for 
creating divided document classification result information 
based on a classification result of the divided document 
classification unit; and a document classification result 
creation unit for creating classification result information 

20 of the above document data using the document-divided document 
map and the divided document classification result information . 

According to the above aspect of this invention, when one 
document contains multiple topics and meanings, these can be 
classified into categories according to specific topics and 

25 meanings, so that the classifications do not differ from 



categories desired by a user, thereby enabling the user to 
easily comprehend the classification categories . Furthermore, 
since the positions of the divided documents in documents prior 
to division (documents belonging to the clusters) is displayed, 
5 the user is able to efficiently read the parts of the document 
clusters he or she wishes to read. 

Further, the document classification device further 
comprises a document save unit for saving the document data; 
a divided document save unit for saving the divided document 

10 data; and a document-divided document map save unit for saving 
a document-divided document map created by the document-divided 
document map creation unit. 

Hence for a single document data, it is possible to 
efficiently determine classification results having different 

15 parameters such as the number of classifications, the 
classification method, and the settings used in the 
classifications, without recreating the divided document data 
and the document-divided document map. Furthermore, by 
classifying the document data and saving the data needed to 

20 create the classification result, the user is free to take more 
time over the classification, and to re-analyze previously 
classified documents within a given period of time. 

Further, the document classification device in the 
specific arrangement described above further comprises a 

25 divided document classification result save unit for saving 



divided document classification result information created by 
the divided document classification result creation unit. 

Hence, an additional effect, such that after one 
classification has been carried out, the result of that 
5 classification can be expressed in a variety of formats such 
as text, charts, graphs, and the like can be achieved. 
Furthermore, by saving the divided document classification 
result information, the user is free to take more time over 
classifications and analysis of classification results, and to 

10 re-analyze previously classified documents in a variety of 
formats within a given period of time. 

Further, the multiple divided document data created by 
the document dividing unit contains the document data in its 
state prior to being divided. 

15 Hence in addition to a classification structure of 

detailed document data, obtained by classifying the divided 
document data, the user can obtain a classification structure 
fusing schematic macro classifications as a result of 
classifying the document data itself prior to division. 

20 Further, the document dividing unit divides document data 

based on information relating to the structure of the document 
data . 

Hence division and the like of different topics can be 
carried out, whereby documents can be classified in such a 
25 manner that the detailed classification structures of their 



document data can be known . 

Further, the document classification device further 
comprises a document element extraction unit for extracting 
elements in the document data; an element-accompanying 
5 information extraction unit for extracting element- 
accompanying information accompanying the elements extracted 
by the document element extraction unit. Furthermore, the 
document dividing unit divides the document data using elements 
extracted by the document element extraction unit, or the 
10 elements and element-accompanying information extracted by the 
element-accompanying information extraction unit. 

Hence documents can be classified so that the detailed 
classification structure of the document data can be known. 

Further, the document dividing unit divides document data 
15 in compliance with a specified specification range. 

Hence documents can be classified in accordance with the 
wishes of the user, and so that the detailed classification 
structure of the document data can be known. 

Further, the document dividing unit divides document data 
20 based on the number of letters, the number of sentences, or both 
the number of letters and the number of sentences. 

Hence there is an increased capability to classify 
different documents having contents of different topics and the 
like. Therefore, as above, documents can be classified so that 
25 the detailed classification structure of the document data can 



be known. 

Further, the document classification result creation 
unit extracts and presents information showing document data, 
and representative information accompanying the document data, 
5 as classification result information. 

Hence the user is able to determine a detailed schematic 
structure or overall structure of the document data. 

Further, the document classification result creation 
unit extracts and presents information showing divided document 
10 data, and representative information accompanying the divided 
document data, as classification result information. 

Hence the user is able to determine a detailed schematic 
structure or overall structure of the document data. In 
addition, the user can easily determine which divided document 
15 has been classified in a given category. 

A document processing method according to still another 
aspect of the present invention outputs multiple input document 
data in order to display or print the document data in a 
predetermined format, and comprises the steps of storing input 
20 document data; selecting all or part of the document data stored 
in the documents memory unit; extracting data relating to 
characteristics of letter rows from all or part of the document 
data selected by the selection unit; work-processing all or part 
of the document data based on the data relating to 
25 characteristics of letter rows extracted by the characteristics 



extraction unit; and outputting all or part of the document data 
work-processed by the work processing unit. 

According to the above aspect of this invention, when 
analyzing documents according to their meanings, rather than 
5 merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

Further, the step of outputting comprises the steps of 
setting a plurality of item values based on the contents of all 
or part of the document data work-processed by the work- 

10 processing unit; and totalizing all or part of the document data 
for each item value set by the item value set unit; and outputs 
all or part of the document data in the format of a table having 
an item value as at least one axis. 

Hence the result of the work-processing can easily be 

15 expressed in a cross table, and the contents of the information 
can easily be grasped. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 
result of the analysis, the entire information analysis 
operation can be supported. 

20 Further, the step of outputting further comprises 

outputting all or part of the document data work-processed by 
the work processing unit together with all or part of the 
document data in its state prior to work-processing by the work 
processing unit. 

25 Hence the data to be work-processed and other data can 



be displayed simultaneously and identified, whereby the range 
of the work-processing to be carried out can be accurately and 
easily determined. Therefore, when analyzing documents 
according to their meanings, rather than merely outputting the 
5 result of the analysis, the entire information analysis 
operation can be supported. 

Further, the step of storing further comprises storing 
all or part of the document data work-processed by the work 
processing unit. 

10 Since other data can be handled simultaneously, when 

thereafter analyzing documents according to their meanings, 
rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the step of selecting further comprises 

15 selecting all or part of the document data output by the output 
unit . 

Since all or part of the document data output by the output 
unit can be selected for analysis, a wide variety of information 
can be analyzed with high precision. Therefore, when analyzing 
20 documents according to their meanings, rather than merely 
outputting the result of the analysis, the entire information 
analysis operation can be supported. 

Further, the step of storing a document further comprises 
storing data relating to contents of the work processing. 
25 Hence not only can loss of data relating to the contents 



of work-processing can be prevented and the data managed easily, 
but also the relationship between settings used in the 
work-processing and the processed result can be determined. 
Therefore, when analyzing documents according to their meanings , 
5 rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

A document classification method for classifying 
documents based on contents thereof according to still another 
aspect of the present invention comprises the steps of inputting 

10 document data; language-analyzing document data input in the 
step of inputting and obtaining language analysis information; 
creating document characteristic vectors for the document data 
based on the language analysis information obtained in the step 
of language-analyzing; classifying documents based on the 

15 degree of similarity between document characteristic vectors 
created in the step of creating vectors, and creating clusters 
of documents; calculating cluster characteristics, being 
characteristics of clusters of documents created in the step 
of classifying; and storing cluster characteristics, 

20 calculated in the step of calculating cluster characteristics, 
as constituent elements of classification categories. 

According to the above aspect of this invention, it is 
possible to obtain clusters, and to structure and categorize 
the clusters based on their contents using their degree of 

25 similarity to the cluster center, and the like. 



A document classification method for classifying 
documents based on contents thereof according to still another 
aspect of the present invention comprises the steps of inputting 
document data; language-analyzing document data input in the 
5 step of inputting and obtaining language analysis information; 
creating document characteristic vectors for the document data 
based on the language analysis information obtained in the step 
of language-analyzing; classifying documents based on the 
degree of similarity between document characteristic vectors 

10 created in the step of creating vectors, and creating clusters' 
of documents; calculating cluster characteristics, which are 
characteristics of clusters of documents created in the step 
of classifying; displaying the cluster characteristics 
calculated in the step of calculating cluster characteristics; 

15 selecting predetermined clusters from cluster of documents 
created in the step of classifying; and storing cluster 
characteristics, calculated in the step of calculating cluster 
characteristics, as constituent elements of classification 
categories . 

20 According to the above aspect of this invention, only 

selected clusters are used, making it possible to structure and 
categorize to clusters in a manner closer to that desired by 
the operator. 

Further, the document classification method further 
25 comprises a step of correcting document characteristic vectors 



stored in the step of storing document characteristic vectors, 
so that document characteristic vectors of documents belonging 
to clusters selected by the step of selecting clusters are 
deleted. Furthermore, the step of classifying comprises 
5 classifying documents based on the document characteristic 
vectors corrected by the step of correcting vectors. 

Hence the effects of clusters which are already known can 
be eliminated, and new clusters can be created. 

Further, the document classification method further 

10 comprises a step of correcting document expression space when 
determining the degree of similarity between document 
characteristic vectors stored in the step of storing document 
characteristic vectors, based on a characteristics amount 
calculated from clusters selected in the step of selecting 

15 clusters, and the step of classifying comprises classifying 
documents based on the degree of similarity between document 
characteristic vectors created in the step of creating vectors, 
using the document expression space corrected in the step of 
correcting the document expression space. 

20 Hence cluster characteristics selected by the operator 

in the previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

Further, the document classification method further 
comprises the steps of correcting document expression space 

25 when determining the degree of similarity between document 



characteristic vectors stored in the step of storing document 
characteristic vectors, based on a characteristics amount 
calculated from clusters selected in the step of selecting 
clusters. Furthermore, the step of classifying comprises 
5 classifying documents based on the degree of similarity between 
document characteristic vectors created in the step of creating 
vectors, using the document expression space corrected in the 
step of correcting the document expression space. 

Hence influences of the known cluster can be eliminated 

10 and cluster characteristics selected by the operator in the 
previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 

Further, the document classification method further 
comprises the steps of appending selection information showing 

15 the fact of selection when all or part of the documents belonging 
to a cluster of documents created in the step of classifying 
have been selected. Furthermore, the step of displaying 
comprises displaying the cluster characteristics, and 
displaying the selection information appended in the step of 

20 appending selection information. 

Hence it is possible to improve the ability to identify 
documents used on multiple occasions, and the ability to 
identify documents which have not been selected at all. 

Further, the step of creating classification categories 

25 comprises creating cluster characteristics and/or information 



created by an operator, in addition to all or part of the 
documents belonging to a cluster of documents selected in the 
step of specifying selection, as constituent elements of 
classification categories. 
5 Hence the contents of clusters can be easily recognized, 

and in addition, the operator can easily create his own 
classification categories, thereby improving the usefulness of 
the classification categories. 

A document classification method for classifying 
10 document clusters in accordance with contents thereof according 
to still another aspect of the present invention comprises the 
steps of inputting document data groups; dividing document data 
into one or multiple divided document data based on a 
predetermined reference; creating a map showing the 
15 correspondence between the document data and the divided 
document data; classifying the divided document data; creating 
divided document classification result information based on the 
classification result of classifying the divided documents; and 
creating classification result information of the document data 
20 using the document-divided document map and the divided 
document classification result information. 

According to the above aspect of this invention, when one 
document contains multiple topics and meanings, these can be 
classified into categories according to specific topics and 
25 meanings, so that the classifications do not differ from 



categories desired by a user, thereby enabling the user to 
easily comprehend the classification categories . Furthermore, 
since the positions of the divided documents in documents prior 
to division (documents belonging to the clusters) is displayed, 
5 the user is able to efficiently read the parts of the document 
clusters he or she wishes to read. 

A computer-readable recording medium of still another 
aspect of the present invention stores programs for executing 
the above-described document classification method on a 

10 computer, thereby making the program readable mechanically , and 
enabling the operation of the document classification method 
to be executed by a computer. 

Other objects and features of this invention will become 
understood from the following description with reference to the 

15 accompanying drawings. 

RRTEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a diagram explaining the entire hardware 
constitution of a data processing system comprising a document 
20 processor according to a first embodiment of the present 
invention; 

Fig. 2 is a diagram explaining the hardware constitution 
of a server in a data processing system comprising the document 
processor according to the first embodiment of the present 
25 invention; 



Fig. 3 is a diagram explaining the hardware constitution 
of a client in a data processing system comprising the document 
processor according to the first embodiment of the present 
invention; 

5 Fig. 4 is a block diagram functionally showing a 

constitution of the document processor according to the first 
embodiment of the present invention; 

Fig. 5 is a diagram explaining the relationship between 
item names of the document processor according to the first 
10 embodiment of the present invention; 

Fig. 6 is a diagram explaining a data structure of a 
document stored in a document memory of the document processor 
according to the first embodiment of the present invention; 

Fig. 7 is a diagram explaining another data structure of 
15 a document stored in a document memory of the document processor 
according to the first embodiment of the present inventions- 
Fig. 8 is a diagram explaining an example of a screen 
display in an output section of a document processor according 
to an embodiment of the present invention; 
20 Fig. 9 is a diagram explaining another example of a screen 

display of an output section of a document processor according 
to an embodiment of the present invention; 

Fig. 10 is a diagram explaining another example of a 
screen display of an output section of a document processor 
25 according to an embodiment of the present invention; 



Fig. 11 is a diagram explaining a list of contents of 
extraction processing performed by a characteristics extractor 
of a document processor according to the first embodiment of 
the present invention; 
5 Fig. 12 is a diagram explaining a list of contents of work 

processing performed by a work processor of a document processor 
according to the first embodiment of the present invention; 

Fig. 13 is a diagram explaining characteristic vectors 
of each item of a document processor according to the first 
10 embodiment of the present invention; 

Fig. 14 is a diagram explaining words, and the number of 
appearances of each word ID, of a document processor according 
to the first embodiment of the present invention; 

Fig. 15 is a diagram explaining another screen display 
15 of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 16 is a diagram explaining a command screen for 
creating a cross table in an output section of a document 
processor according to the first embodiment of the present 
20 invention; 

Fig. 17 is a diagram explaining a cross table displaying 
a result of classification processing by an output section of 
a document processor according to the first embodiment of the 
present invention; 
25 Fig. 18 is a diagram explaining another cross table 



displaying a result of classification processing by an output 
section of a document processor according to the first 
embodiment of the present invention; 

Fig . 19 is a block diagram showing a detailed constitution 
5 of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 20 is a flowchart showing an output sequence of a 
cross table of a document processor according to the first 
embodiment of the present invention; 
10 Fig. 21 is a diagram explaining another screen display 

of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 22 is a diagram explaining another screen display 
of an output section of a document processor according to the 
15 first embodiment of the present invention; 

Fig. 23 is a diagram explaining another screen display 
of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 24 is a diagram explaining another screen display 
20 of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 25 is a block diagram showing a detailed constitution 
of document memory of a document processor according to the 
first embodiment of the present invention; 
25 Fig. 2 6 is a diagram explaining another screen display 



of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 27 is a diagram explaining another screen display 
of an output section of a document processor according to the 
5 first embodiment of the present invention; 

Fig. 28 is a diagram explaining another screen display 
of an output section of a document processor according to the 
first embodiment of the present invention; 

Fig. 2 9 is flowchart showing a sequence of document 
10 processing in a document processor according to the first 
embodiment of the present invention; 

Fig. 30 is a block diagram functionally showing a 
constitution of a document classification device according to 
a second embodiment of the present invention; 
15 Fig. 31 is a diagram explaining an example of a display 

of a cluster characteristics display section in a document 
classification device according to the second embodiment of the 
present invention; 

Fig. 32 is a flowchart showing a sequence of processing 
20 in a document classification device according to the second 
embodiment of the present invention; 

Fig. 33 is a block diagram functionally showing a 
constitution of a document classification device according to 
a third embodiment of the present invention; 
25 Fig. 34 is a flowchart showing a sequence of processing 



in a document classification device according to the third 
embodiment of the present invention; 

Fig. 35 is a block diagram functionally showing a 
constitution of a document classification device according to 
5 a fourth embodiment of the present invention; 

Fig. 36 is a flowchart showing a sequence of processing 
in a document classification device according to the fourth 
embodiment of the present invention; 

Fig. 37 is a block diagram functionally showing a 
10 constitution of a document classification device according to 
a fifth embodiment of the present invention; 

Fig. 38 is a flowchart showing a sequence of processing 
in a document classification device according to the fifth 
embodiment of the present invention; 
15 Fig. 39 is a block diagram functionally showing a 

constitution of a document classification device according to 
a sixth embodiment of the present invention; 

Fig. 4 0 is a diagram explaining a table provided in a 
classification result memory of a document classification 
20 device according to the sixth embodiment of the present 
invention; 

Fig. 41 is a flowchart showing a processing sequence of 
a selection information append section of a document 
classification device according to the sixth embodiment of the 
25 present invention; 



Fig. 42 is a block diagram showing a constitution of a 
document classification device according to a seventh 
embodiment of the present invention; 

Fig. 43 is a diagram explaining a document classification 
5 device and a document classification method according to the 
seventh embodiment of the present invention; 

Fig. 4 4 is another diagram explaining a document 
classification device and a document classification method 
according to the seventh embodiment of the present invention; 
10 Fig. 45 is another diagram explaining a document 

classification device and a document classification method 
according to the seventh embodiment of the present inventions- 
Fig. 4 6 is another diagram explaining a document 
classification device and a document classification method 
15 according to the seventh embodiment of the present invention; 

Fig. 47 is a block diagram showing a constitution of a 
document classification device according to an eighth 
embodiment of the present invention; 

Fig. 4 8 is a block diagram showing a constitution of a 
20 document classification device according to a ninth embodiment 
of the present invention; 

Fig. 49 is a diagram explaining a document classification 
device and a document classification method according to a tenth 
embodiment of the present invention; 
25 Fig. 50 is a diagram explaining a document classification 



device and a document classification method according to an 
eleventh embodiment of the present invention; 

Fig. 51 is a block diagram showing a constitution of a 
document classification device according to a twelfth 
5 embodiment of the present invention; 

Fig. 52 is a diagram explaining a document classification 
device and a document classification method according to the 
twelfth embodiment of the present invention; 

Fig. 53 is a diagram explaining a document classification 
10 device and a document classification method according to a 
thirteenth embodiment of the present invention; 

Fig. 54 is a diagram explaining a document classification 
device and a document classification method according to a 
fourteenth embodiment of the present invention; 
15 Fig. 55 is a diagram explaining a document classification 

device and a document classification method according to a 
fifteenth embodiment of the present invention; and 

Fig. 56 is a diagram explaining a document classification 
device and a document classification method according to a 
20 sixteenth embodiment of the present invention. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Preferred embodiments of a document processor, a document 
processing method, and a computer-readable recording medium for 
25 recording a program to execute the method on a computer 
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according to the present invention will be described below with 
reference to the accompanying drawings . 

To begin with, the hardware constitution of an entire data 
processing system comprising a document processor according to 
5 a first embodiment of the present invention will be explained. 
Fig. 1 is a diagram explaining the hardware constitution of an 
entire data processing system comprising a document processor 
according to the first embodiment of the present invention. 

As shown in Fig. 1, a data processing system comprising 

10 the document processor according to the first embodiment 
comprises a server/client system. That is, a server 101 and 
multiple clients 102 are connected via a network 103. The 
clients 102 create work data such as classification data, send 
this to the server 101, and display the results of work 

15 processing such as classification, and the like. On the other 
hand, in compliance with specifications from the clients 102, 
the server 101 carries out vast numerical calculations to 
perform work processing such as document {text) classification, 
and sends the results of the processing to the clients 102. 

20 More specifically, when performing classification 

processing, the server 101 classifies a text (pre-processing, 
clustering) and the clients 102 create classification data, 
program execution commands, tables of text classification 
result, and such like. As described above, the processing at 

25 the server 101 is divided into two types, "pre-processing" and 



"classification", and the burden of this processing can be 
extremely heavy when there is a vast amount of data. Therefore, 
a manager process creates a processing receive list and controls 
the processing, so that "pre-processing" and "classification" 
5 in the server 101 are only performed once each. 

Furthermore, data is exchanged between the server 101 and 
the clients 102 by a method termed joint filing. That is, a 
file used in processing such as classification is created in 
a joint folder on the server 101, enabling both sides to exchange 

10 the data. Therefore, the clients 102 can use the joint folder 
of the server 101 via the joint network. 

The constitution of the hardware of the server 101 and 
the clients 102 will be explained below. Fig. 2 is a diagram 
explaining a hardware constitution of the server 101 in the data 

15 processing system comprising the document processor according 
to the first embodiment. A work station (WS) is, for example, 
used as the server 101. 

In Fig. 2, reference symbol 201 represents a CPU for 
controlling the entire server 101, reference symbol 202 

20 represents a ROM which stores boot programs and the like, 
reference symbol 203 represents a RAM used as work area of the 
CPU 201, reference symbol 204 represents an interface (I/F) , 
which is connected to the network 103 via a communications line 
205 and controls the network 103 and an internal interface, and 

25 reference symbol 206 represents a disk device for storing data. 



The reference symbol 200 represents a bus for coupling the above 
parts . 

In addition, a display 208 for displaying document 
information, image information, function information, and the 
5 like, a keyboard 209 for inputting data, and a mouse 210 and 
the like, may similarly be connected. Moreover, the disk device 
206 comprises a joint folder 207 for exchanging data with the 
clients 102. 

Furthermore, Fig. 3 is a diagram explaining a hardware 

10 constitution of a client 102 in a data processing system 
comprising the document processor according to the first 
embodiment. A personal computer (PC) is, for example, used as 
the client 102. 

In Fig. 3, reference symbol 301 represents a CPU for 

15 controlling the entire system, reference symbol 302 represents 
a ROM which stores boot programs and the like, reference symbol 
303 represents a RAM used as a work area of the CPU 301, reference 
symbol 304 represents an HDD (hard disk drive) for controlling 
reading and writing of data to an HD (hard disk) 305 in compliance 

20 with the CPU 301, reference symbol 305 represents an HD for 
storing data written in compliance with the HDD 304, reference 
symbol 306 represents an FDD ( floppy disk drive) for controlling 
reading and writing of data to an FD (floppy disk) 307 in 
compliance with the CPU 301, reference symbol 307 represents 

25 a freely attachable and detachable FD for storing data written 



in compliance with the FDD 306, and reference symbol 308 
represents a display for displaying documents, images, function 
data, etc. 

Furthermore, reference symbol 309 represents an 
5 interface (I/F), which is connected to the network 103 via a 
communications line 310 and controls the network 103 and the 
internal interface, reference symbol 311 represents a keyboard 
comprising keys for inputting letters, numbers, a variety of 
commands, and the like, reference symbol 312 represents a mouse 

10 for moving a cursor and selecting a range, or pressing icons 
and buttons displayed on a display screen, moving windows and 
changing their sizes, and the like, reference symbol 313 
represents a scanner for optically reading images having an OCR 
(optical character reader) function, reference symbol 314 

15 represents a printer for printing contents and the like of data 
comprising classification results, and reference symbol 315 
represents a bus for joining all the above parts. Furthermore, 
an application software 316 such as a word processing software 
is stored in the HD 305. 

20 Functional constitution of the document processor 

according to the first embodiment will be explained here. Fig. 
4 is a block diagram functionally showing a constitution of the 
document processor according to the first embodiment of the 
present invention. In Fig. 4, the document processor comprises 

25 an input section 401, a document memory 402, a selector 403, 



a characteristics extractor 404, a work processor 405, and an 
output section 406. 

The input section 401, the document memory 402, the 
selector 403, the characteristics extractor 404, the work 
5 processor 405, and the output section 406, are controlled by 
CPU 201 and CPU 301 and the like, which operate processing in 
compliance with commands contained in programs recorded in 
recording media such as a ROM 202 and 302, a RAM 203 and 303, 
or a disk device 306 and a hard disk 316, etc. 

10 The input section 401 inputs document data, and for 

example comprises the I/F 204 or 309, or the like, capable of 
obtaining documents and groups of documents via a keyboard 209 
or 311, a scanner 313 comprising an OCR function, and a network 
103. Furthermore, in addition to the above, if the input 

15 section 401 is capable of extracting document data, it comprises 
all the above parts. For example, when the document data is 
saved in a data base, and the medium in which the data base is 
stored is provided in the document processor of the first 
embodiment, document data is input. 

20 A document is a collection of one or more sentences 

written in a natural language, comprising letters, rows of 
letters, numbers, and the like, which are organized into a 
meaningful arrangement to form one document. Furthermore, a 
collection of multiple documents is termed a document cluster. 

25 A document comprises one or multiple items. An item 



comprises item name and item value. An item name is a label 
showing the contents of the item, and may or may not be included 
in the document. An item value is the actual content of the 
item. Fig. 5 is a diagram explaining the relationship between 
5 an item name and an item value in the document processor 
according to the first embodiment. Fig. 5 shows an example in 
which one patent specification forms one document, and the 
patent specification is expressed using an item name and an item 
value . 

10 A unique document ID is appended to each document and each 

document in the document clusters obtained by the input section 
401, and these are stored in the document memory 402. Fig. 6 
is a diagram explaining the structure of document data stored 
in the document memory 402 of the document processor according 

15 to the first embodiment. Each of the item names and item values 
are saved in one memory unit, that is, in one cell of the document 
memory 4 02. 

In Fig. 6, one cell comprises three memory regions, and 
the position (number) of the next cell in the document memory 
20 402 is stored in the first memory region 601 . The generic value 
of the cell is stored in the second memory region 602. 

The generic values of the cells can, for example, be set 
such that "0" signifies "empty", "1" signifies "numerical 
value", and "2" signifies a letter row. . . . The content of the 
25 cell, that is, the head position of the region which the item 



name or the item value and the like are stored in, is stored 
in the third memory region 603. 

Rearrangement of the cell sequence, and addition and 
deletion of cells, can easily be performed by changing the 
5 position of the next cell stored in the first memory region 601. 
Furthermore, since the actual content of the cell is stored in 
a different region in the cell structure, when an item has been 
updated and can no longer be held in a region reserved in advance, 
for example, it is only necessary to reserve another large 

10 region in which to store the item, with no effect on the structure 
of the cell itself, and to update the head position of the third 
memory region 603 stored third. 

Fig. 7 is a diagram explaining another data structure of 
a document stored in the document memory 4 02 of the document 

15 processor according to the first embodiment. In Fig. 7, one 
cell uses two memory regions. The generic value of the cell 
is stored in a first memory region 701 . The content of the cell, 
that is, the head position of the region which the item name 
or the item value and the like are stored in, is stored in a 

20 second memory region 702. 

The next cell is stored in the next memory region adjacent 
in the document memory 402. With this data structure, a 
movement operation within the memory is required when cells have 
been rearranged, added, or deleted. 

25 The document memory 402 comprises a semiconductor memory 



for handling data usually at high-speed, but may include an 
auxiliary memory device comprising a magnetic disk, an optical 
disk, or the like. 

Documents and document clusters stored in the document 
5 memory 402 are displayed by the output section 406 . In the first 
embodiment, the output section 406 comprises a CRT display, a 
liquid crystal display, or the like. The output section 406 
reads out the contents of documents and document clusters stored 
in the document memory 402 in the cell sequence, and displays 

10 or prints them in table format. 

Furthermore, the output section 40 6 may also comprise a 
graph drawer 407 for drawing graph based on the data displayed 
or printed in table format. The graph drawer 407 reads out 
contents of a region set by the user with respect to item values 

15 of a document or a cluster of documents stored in the document 
memory 402, draws graph such as bar graphs, pie charts, regular 
line graphs, and the like, and displays and prints them. 

The output section 406 also displays operations of the 
input section 401, by for example displaying operation menus, 

20 mouse pointers, cursor displays, and the like. Furthermore, 
the output section 406 may also comprise a printing device such 
as a printer for printing the results of processing. 

In compliance with a command input by the operator to the 
input section 401, the selector 403 reads out data in a region 

25 selected by the display of the output section 406 from the 



document memory 402, and sends it to the characteristics 
extractor 404. The method by which the selector 403 makes its 
selection will be explained using Figs. 8 to 10. 

Figs. 8 to 10 are diagrams explaining examples of screen 
5 displays of the output section 40 6 of the document processor 
according to the first embodiment. More specifically, the 
diagrams show examples of screen displays listing types of 
vehicle malfunctions. In Fig. 8, the display screen displays 
a "numbers" column 801 showing document ID numbers, a "date 

10 received" column 802 showing the date on which the malfunction 
information was received, a "sales office" column 803 showing 
the sales office where the malfunction information was received, 
a "vehicle type" column 804 showing the type of vehicle to which 
the information refers, a "year" column 805 showing the year 

15 of the vehicle to which the information refers, and a "contents" 
column 806 showing the content of the malfunction information. 

In Fig. 9, a selected region 901 is the portion displayed 
within the rectangle and altered in color. Similarly, in Fig. 
10, the selected region 1001 is the portion displayed within 

20 the rectangle and altered in color. 

The region selected by the selector 403 may be one part 
of a column displayed on the screen as shown in Fig. 9, or, when 
an item name is selected as shown in Fig. 10, all the item value 
belonging to that item name may be selected. In the first 

25 embodiment, only regions belonging to letter rows can be 



selected. 

Next, the process of extraction performed by the 
characteristics extractor 404 will be explained. An item value 
is selected by the selector 403, and the characteristics of the 
5 item name are extracted by the characteristics extractor 404. 
Fig. 11 is a diagram explaining a list of contents of extraction 
processing performed by the characteristics extractor 404 of 
the document processor according to the first embodiment. 

In Fig. 11, extraction includes extracting a word 
10 contained in a letter row, the number of words, the number of 
letters in the word, the number of appearances of that word, 
etc. These are extracted using a natural language processing 
technique such as format element analysis or syntax analysis, 
generally used in devices such as a regulatory audio synthesizer 
15 device or an automated translation device. 

Next, work processing performed by the work processor 405 
will be explained. The work processor 405 processes the amount 
of characteristics extracted by the characteristics extractor 
404. Fig. 12 is a diagram explaining a list of contents of work 
20 processing performed by the work processor 405 of the document 
processor according to the first embodiment. 

Work processing comprises processing such as 
"classification" for classifying identical characteristics, 
"retrieval" for retrieving a predetermined amount of 
25 characteristics, "rearranging" for rearranging contents of the 



characteristics amount, "representative extraction" for 
extracting a representative value of an amount of 
characteristics, "maximum value extraction" for extracting a 
maximum value from an amount of characteristics, "minimum value 
5 extraction" for extracting a minimum value from an amount of 
characteristics, "calculation" for calculating an amount of 
characteristics, and such like. 

The operator can select his own combination of the 
contents of characteristics extracted by the characteristics 
10 extractor 404, and extracted characteristics processed by the 
work processor 405. Furthermore, it is possible to preset 
highly-efficient combinations, and supply these to the 
operator . 

The result of the processing carried out by the work 
15 processor 405 is saved in a work-processing result saving 
section 408 in the work processor 405. The processed result 
saved in the work-processing result saving section 408 is output 
from the output section 406. The output section 406 reads out 
the contents of the work-processing result saving section 408, 
20 and displays or prints them. 

Here, an example will be explained in which the number 
of appearances of a word contained in the item value is selected 
as the (amount of) characteristics extracted by the 
characteristics extractor 404, and classification is selected 
25 as the work-processing to be carried out by the work processor 



405 . 

In general, when there are two documents, and the 
incidence of appearance of words comprising the two documents 
are equal, it can be assumed that the meanings of the two 
5 documents are similar to each other. That is, the number of 
appearances of a word in a document is a characteristic having 
a profound relationship to the meaning of the document. 
Therefore, it can be envisaged that when multiple documents have 
been classified using the number of appearances of a word 

10 therein as a characteristic, the relevant documents having a 
meaning close to the classification categories will. 

The analyzer 409 in the characteristics extractor 404 
performs natural language analysis, such as format element 
analysis, to each of one or multiple item values selected by 

15 the selector 403, and divides them into words. Furthermore, 
information representing the part of speech of each word is 
appended thereto. Of the words appearing, a unique word ID is 
appended to those that are nouns, and the number of appearances 
of each word ID is counted for one item value, and for all item 

20 values selected by the selector 403. 

The characteristics extractor 404 comprises a 
characteristic vector creator 410, which creates an item value 
characteristic vector showing the (amount of) characteristics 
of individual item values based on the number of appearances 

25 counted. For example, Fig. 13 shows the characteristic vectors 



for each item value when the item values selected by the selector 
403 are: 

"Large noise pollution" 

"Vehicle paint changes color" 
5 "Overheat occurs" 

"Paint is peeling" 

"Battery is dead" 

"Black exhaust fumes" 
Furthermore, Fig. 14 shows the words and the number of 
10 appearances of each of the word IDs those words. 

Hence, the following characteristic vectors were 
obtained : 

"Large noise pollution" : {1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

0} 

15 "Vehicle paint changes color" : {0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 
0, 0, 0} 

"Overheat occurs" : {0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0} 
"Paint is peeling" : {0/ 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0} 
"Battery is dead" : {0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0} 
20 "Black exhaust fumes" : {0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 
1} • 

The characteristic vectors of these item values are 
output from the characteristics extractor 404 and sent to the 
work processor 405. The work processor 405 classifies the 
25 documents using the characteristic vectors of the item values. 



Firstly, the distances between the individual vectors are 
calculated. For example, the distances can be measured using 
their inner products. 

After the distance have been calculated, the vectors with 
the nearest distances are gathered together. For example, a 
K-means method is used to classify a group of vectors into K 
numbers of vector groups in correspondence with the distances 
thereof. When the vectors have been classified, the work 
processor 405 appends numbers showing which classification the 
vectors belong with respect to their item values, that is, 
cluster numbers, and document IDs corresponding to the item 
values, and sends the result to the output section 406, where 
they are displayed. 

Fig. 15 shows an example of a screen display of cluster 
number 1501. Documents which have the same cluster number (for 
example, documents "1" and "6", both have the cluster number 
"5") belong to the same classification group. 

Next, an arrangement of a second aspect of the present 
invention in which a cross table is output will be explained. 
After the input section 401 has read out a cluster of documents 
to be analyzed, the operator inputs commands indicating the 
names of items to be classified, the names of items which will 
form the vertical or horizontal axis of the cross table, and 
the number of classifications. 

Fig. 16 shows a command screen for creating a cross table. 
48 



In Fig. 16, the command screen 1600 comprises a process item 
name input column 1601, an axis item name column 1602 , a vertical 
axis command button 1603, a horizontal axis command button 1604, 
and a classification number input column 1605. 
5 The name of the item to be processed 1601 is input to the 

process item name input column 1601 . The item name can be input 
from the keyboard 209 or the like, or by using the mouse 210 
or the like to select an item from available items being 
displayed. Furthermore, the name of the item to be the vertical 
10 axis is input to the axis item name column 1602. This can be 
input by the same method as to the process item name input column 
1601. 

The vertical axis command button 1603 and the horizontal 
axis command button 1604 are for specifying commands to display 

15 an item name to be an axis on the vertical axis or the horizontal 
axis. Furthermore, the number of classifications is input to 
the classification number input column 1605. The number of 
classifications can be input from the keyboard 209 or the like, 
or by using the mouse 210 or the like to select an item from 

20 available items being displayed. 

In Fig. 16, "contents" is input to the process item name 
input column 1601, "vehicle type" is input to the axis item name 
column 1602, the horizontal axis command button 1604 is marked, 
and "50" is input to the classification number input column 1605 . 

25 This indicates that a command has been given to classify the 



document cluster into "50" classifications based on the 
"contents" of the document cluster, and to display "vehicle 
type" along the horizontal axis of the cross table. 

Following a command to create the cross table, 
5 classification is carried out, and the classification result 
is displayed in the cross table. Figs. 17 and 18 are diagrams 
showing cross charts displaying classification results. In 
the cross table 1700 of Fig. 17, the vertical axis displays 
"cluster 1", "cluster 2", . . . , showing classifications, and the 

10 horizontal axis displays "ABC1600", "ABC1800", showing 
vehicle types . 

The vertical axis of the table, that is, the lines, 
correspond to clusters created by classification. The first 
column of each line contains letter rows showing values 

15 determined at the end of classification as preset cluster 
numbers . The horizontal axis of the table, that is, the columns, 
display non-duplicating letter rows contained in the item 
"vehicle name" of the document cluster. Each cell of the line 
"cluster 1" displays the number of the documents classified into 

20 cluster 1 in which the value of the item "vehicle type" matches 
the vehicle type in that column. 

Here, instead of displaying numbers, the size of numbers 
to display the color intensity of the cell, or the area needed 
to paint the cell, need only be expressed. Furthermore, the 

25 columns on the far right and far left of the table show the totals 



of lines and columns. 

In Fig. 18, by moving a mouse pointer 1800 to a cell of 
the cross table 1700, pressing the mouse button of the mouse 
210, or moving the cursor by operating a cursor key on the 
keyboard 209, and pressing a specific key, so that the content 
display screen 1801 near that cell is displayed, the item 
"contents" of the corresponding document are displayed. 

The content display screen 1801 displays the number of 
data in the cell, the display items, cell information, and 
contents of the display items in the data. The cell specified 
by the mouse pointer 1800 displays a data number: "4", display 
item: "contents", cell information: "ABC2000-cluster 1", and 
four contents as "contents" of the display items: "exhaust is 
black, exhaust is black,...". Consequently, the contents of 
a cell can be identified simply by moving the mouse pointer to 
the desired cell and pressing the mouse button. 

Furthermore, the items displayed in the content display 
screen 1801 can be updated by resetting, all the items can be 
displayed, and items can be selectively displayed. 

The first column of each line contains letter rows showing 
values determined at the end of classification as preset cluster 
numbers. This column can be rewritten by the operator. For 
example, after confirming the contents of a cell by the 
operation described above, "cluster 1" can be rewritten as 
"exhaust problems." As a consequence, it is easier to grasp 
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the content of the information. 

Furthermore, instead of inserting a letter row showing 
the value determined at the end of classification as a preset 
cluster number, it is possible to extract a letter row showing 
the characteristics of the cluster, and insert this into the 
cell. For example, this can be achieved by extracting the 
phrases and words which appear most frequently from the item 
"contents" of the document contained in cluster 1. 

In Fig. 18, words such as "exhaus-t is black" or "exhaust" 
are entered into the cluster 1. Thus, by a simple operation, 
the operator is able to learn not only the distribution of the 
entire document, but also, where necessary, the detailed 
contents of individual documents . 

Next, the constitution of the output section 406 for 
creating a cross table will be explained in detail. Fig. 19 
is a block diagram showing a constitution of the output section 
40 6 of the document processor according to the first embodiment . 
The output section 406 comprises an item value selector 1901, 
and a totalizer 1902, in addition to the graph-drawing section 
407. Moreover, the totalizer 1902 comprises a table saving 
section 1903 having a memory region in correspondence with 
contents which are actually displayed. 

In compliance with an item name (axial item name) 
specified by the operator as one axis of the cross table, the 
item value selector 1901 sequentially reads out item values from 
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document data stored in the document memory 402, and gathers 
item values which are not duplicated. Furthermore, the 
totalizer 1902 totalizes the document by adding a numerical 
value to the region corresponding to the item value of the table 
saving section 1903. 

Next, the output sequence of a cross table will be 
explained. Fig. 20 is a flowchart showing an output sequence 
of a cross table of the document processor according to the first 
embodiment. In the flowchart of Fig. 20, the contents of the 
table saving section 1903 are initialized (Step S2001) prior 
to totalization. 

Next, an item value produced by the item value selector 
1901 is allocated to a portion of the table corresponding to 
the item value label (Step S2002) , and a letter row expressing 
a cluster number is allocated to a portion corresponding to the 
cluster number (Step S2003) . 

Next, an item value corresponding to the axial item value 
is determined by referring to documents stored in the document 
memory 402 to find document ID which corresponds with the item 
value saved in the work-processing result saving section 408 
(Step S2004) . Thereafter, 1 is added to the contents of the 
corresponding region in the table saving section 1903 (Step 
S2005) . 

It is then determined whether all the item values have 
been processed (Step S2006) , and if not (NO in the Step S2006) , 



the sequence shifts back to the Step S2004, and the processes 
between the Steps S2004 to S2006 are repeated. 

When it has been determined in the Step S2006 that 
processor has been carried out for all the item values (YES in 
5 the Step S2006) , the total number of lines is calculated to be 
displayed in the far right row (Step S2007) , and simultaneously, 
the total number of columns is calculated to be displayed in 
the bottom line (Step S2008) . 

Thereafter, a table formed in the table saving section 
10 1903 is sequentially read out (Step S2009), whereby all 
processing ends. 

Data output from the work processor 405 can be sent to 
the document memory 402, and stored there with other data in 
the document memory 402. Data which have been output from the 
15 work processor 405 and stored in the document memory 402 can 
be displayed by the output section 406 as a new row of the table. 
Furthermore, existing rows of the table can be deleted, and 
replaced by writing the new data. 

In this constitution, the result of the processing, being 
20 the data output from the work processor 405, can be handled on 
an equality with other data which was not processed this time 
in the document memory 402. In subsequent analysis, the data 
can be selected for work processing without needing to 
distinguish whether it was present in the original input data, 
25 or was created by the work processor 405 during analysis. 



Therefore, the data to be work processed and the contents 
of the work processing can be flexibly selected in accordance 
with the type of data, and the contents of the information 
analysis to be performed, enabling a wide variety of information 
to be analyzed with high precision. 

Furthermore, it is possible to input to the work processor 
405 not only data output from the characteristics extractor 404, 
but also data selected by the selector 403. Consequently, 
additional work processing can be carried out to data whose 
characteristics do not need to be extracted from the letter row, 
and to numerical values of the work processed result, enabling 
an even wider variety of information to be analyzed with high 
precision . 

Figs. 21 to 24 are diagrams explaining other examples of 
display screens of the output section 406 of the document 
processor according to the first embodiment. In Fig. 21, a 
"cluster number" 2101 obtained by classification is displayed 
in addition to "number", "date received", "sales office", 
"vehicle type", "year", and "contents". 

Moreover, in Fig. 21, the selector 403 has selected 
"cluster number" 2101, and data relating to the "cluster number" 
2101 is displayed in inverse video. When the "cluster number" 
2101 is indicated using a key, the work processor 405 rearranges 
the data. 

Fig. 22 shows the result of the rearrangement. In Fig. 



22, documents having a "cluster number" of "1" have been 
collected and displayed. Thereafter, documents having a 
"cluster number'' of "2" are collected and displayed. 

More specifically, the documents are rearranged in a 
5 sequence of "numbers" "2", "11", "15", "23", "35", "54", "63", 
"73", and "82", which have a "cluster number" of "1". 
Thereafter, "numbers" "14", "18", "22", "27", "37", which 
have a "cluster number" of "2", are displayed. 

Next, documents whose items in the "vehicle type" column 

10 belong to "cluster number" of "1" are selected. In Fig. 23, 
the items in the "vehicle type" column which belong to "cluster 
number" of "1" have been selected, and the selected region 2301 
is displayed in inverse video. In this way, since the document 
have already been rearranged according to their "cluster 

15 number", and documents belonging to the same cluster have been 
gathered and displayed, they can easily be selected as a 
continuous region on the screen. 

Next, Fig. 24 shows a bar graph of the incidence of 
generation of the separate vehicle types in the selected region 

20 2301. In Fig. 24, the bar graph display region 2401 displays 
the nine selected documents whose "cluster number" is "1", 
selected in the selection region 2301. These nine documents 
are displayed in the bar graph according to their vehicle type. 

In this way, the documents to be work processed can be 

25 flexibly and easily selected, and various kinds of processes 



can be carried out thereto. Furthermore, the processed result 
can be processed again in the next processing, enabling 
information to be analyzed at high precision. 

Here, the characteristics of the letter rows which have 
5 been classified or the like are extracted, and are processed 
in a variety of ways after work processing using the 
characteristics. However, a variety of processing may 
alternatively be performed in advance. 

For example, it is possible to select the item "vehicle 

10 type", rearrange the documents using this as a key, and classify 
the collected vehicle types according to, for example, 
"ABC1600". Furthermore, when a document input by the input 
section 401 contains errors such as misspellings, it is possible 
to retrieve the letter row and replace the errors prior to 

15 extracting the characteristics of the classified letter row and 
carrying out work processing using these characteristics, 
thereby adjusting the data to obtain a more accurate result. 

Fig. 25 is a block diagram showing a detailed constitution 
of the document memory 402 of the document processor according 

2 0 to the first embodiment. In Fig. 25, the document memory 4 02 
comprises a set value memory 2501, and a set value transceiver 
2502. The set value memory 2501 comprises memories, starting 
with a classification data memory 2503, for storing information 
relating to various set values, that is, set values needed for 

25 operations of the document processor. Consequently, the 



information relating to the set values can be stored together 
with the document information. 

Furthermore, the set value transceiver 2502 transmits 
information relating to the set values stored in the set value 
5 memory 2501 to other information processors . Furthermore, the 
set value transceiver 2502 receives the information relating 
to the set values from other information processors. 
Information relating to set values is received by the set value 
transceiver 2502, and is stored in the set value memory 2501. 

10 Stored information relating to set values is read out 

simultaneous to the subsequent second reading of the document, 
and is stored in the set value memory 2501. The operator can 
refer to the information relating to the set values by a 
predetermined operation, and it can be reused in subsequent 

15 processing. Consequently, the information relating to set 
values can be saved and managed together with the documents, 
thereby preventing loss of the information relating to the set 
value, and enabling appropriate set values to be reused later. 

Figs. 26 to 28 are diagrams explaining other examples of 

20 screen displays of the output section 406 of the document 
processor according to the first embodiment. In Fig. 26, 
firstly, the operator selects the "contents" to be classified 
on the display screen. Consequently, the selected region 2601 
is displayed in inverse video. Next, when the classification 

25 button 2603 is selected from a menu bar 2603, a question screen 



2604 appears asking the number of classifications required, 
that is, the number of documents to be classified. 

When the operator enters the number of classifications 
into the question screen 2604, information relating to the 
5 number of classifications is stored in the document memory 402 . 
In Fig. 26, "50" is input as the number of classifications. 

Thereafter, when the operator completes the analysis of 
the information, and presses a save button (not shown in the 
diagram) which pops up on the screen after selecting the file 
10 button 2 605 of the menu bar 2 603, the document memory 402 stores 
the information of the document together with the 
classification result after appending a file name specified by 
the operator. 

In Fig. 27, when the mouse pointer 2702 is moved to a column 
15 2701 displaying the classification result, and the mouse button 
is pressed, a classification information display screen 2703 
displays the information relating to classification used in the 
classification, and information relating to the classification 
set value. As a consequence, relevance of the set value used 
20 can be easily understood. 

The information relating to classification, displayed on 
the classification information display screen 2703, for example 
comprises "classification date" showing information relating 
to the time and date on which classification was carried out, 
25 "number of documents" showing information relating to the 



number of documents that were classified, etc. Furthermore, 
the information relating to the classification set value 
comprises information such as "classification number" showing 
the number of classified documents, and "classification speech 
5 part" showing which part of speech the classification was based 
on . 

A new table is created for each classification. Fig. 28 
shows a second classification result displayed after 
classification has been carried out a second time after 

10 obtaining the first classification result. When the operator 
wishes to display the first classification result again, he or 
she moves the mouse pointer to the selection region 2801 on the 
label at the bottom left of the screen, and presses the mouse 
button. As a consequence, the first classification result is 

15 displayed again. Thereafter, the second classification result 
can be displayed again by performing the same operation. 

Furthermore, in Fig. 28, information relating to the set 
value used in the classifications are displayed in a 
predetermined display region 2802 of the table. The display 

20 region 2802 does not conceal the classification result display, 
and the position of the display can be moved. Consequently, 
the relationship between the classification result and the set 
value can easily be understood. 

Next, a sequence of document processing of the document 

25 processor according to the first embodiment will be explained. 



Fig. 29 is a flowchart showing a document processing sequence 
of the document processor according to the first embodiment. 

In the flowchart of Fig. 29, when starting the process, 
it is determined whether the document data has been input to 
5 the document processor (Step S2901) . Here, the document 
processor waits for the document data to be input, and when the 
document data has been input (YES in Step S2901) , the input 
document data is stored (Step S2902) . The Steps S2901 and S2902 
may be carried out independently of other steps each time 

10 document data is input. 

Next, it is determined whether all or part of the stored 
document data has been selected (Step S2903) . Here, the 
document processor waits for all or part of the document data 
to be selected, and when document data has been selected (YES 

15 in Step S2903) , data relating to letter row characteristics of 
all or part of the stored document data is extracted (Step 

52904) . 

Thereafter, in the Step S2904, predetermined work 
processing, such as classification, is carried out based on the 
20 extracted data relating to the letter row characteristics (Step 

52905) . Following this, data which were work-processed in the 
Step S2905 are output in a table format or the like (Step S2906) . 

Moreover, the data which were work-processed in the Step 
S2905 are stored in correspondence with the original document 
25 data (Step S2907) . Furthermore, data relating to contents of 



the work processing such as the set value of the work processing 
are simultaneously stored (Step S2908) . 

Thereafter, it is determined whether all or part of the 
data processed in the Step S2905 has been selected (Step S2908) . 
5 When the data has been selected (YES in the Step S2908), the 
sequence shifts to the Step S2904 , and thereafter, the processes 
from the Step S2904 to S2909 are repeated. On the other hand, 
when it is determined that all or part of the data processed 
in the Step S2909 has not been selected (NO in the Step S2909) , 

10 all processing ends. 

The document processing explained in the first embodiment 
can be realized using a program prepared in advance on a computer, 
such as a personal computer or a work station. This program 
is recorded on a computer-readable recording medium such as a 

15 hard disk, a floppy disk, a CD-ROM, an MO, or a DVD, and is 
executed by reading out the program from the recording medium 
using the computer. Furthermore, the program can be 
distributed via the recording medium, or by using a network such 
as the Internet as a transmission medium. 

20 Next, an information classification device according to 

a second to sixth embodiments will be explained. In the second 
to sixth embodiments described below, multiple classifications 
are carried out while varying parameters (number of clusters 
and document clusters to be classified, standards of similarity, 

25 stop words, etc.) for document classification, extraction, and 
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positioning of a topic (content) from one cluster of documents, 
based on the same interpretation as above, namely that a 
document cluster includes a great amount of noise . By providing 
means for saving and integrating the results, it is possible 
to gradually determine what kind of contents are contained in 
a given document cluster. 

Since the information processing system comprising the 
document classification device according to the second 
embodiment of the present invention is the same as the first 
embodiment shown in Fig. 1, further explanation will be omitted. 
Furthermore, since the hardware constitution of the server 101 
and the clients 102 is the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 
will be omitted. 

Next, the functional constitution of a document 
classification device according to the second embodiment will 
be explained. Fig. 30 is a block diagram showing a functional 
constitution of the document classification device according 
to the second embodiment. 

As shown in the block diagram of Fig. 30, the document 
classification device comprises an input section 3001, a 
language analyzer 3002 , a vector creator 3003, a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
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category memory 3009, a cluster selection specifier 3010, and 
a classification category viewing operator 3011. 

The input section 3001, the language analyzer 3002, the 
vector creator 3003, the classifier 3004, the classification 
5 parameter specifier 3005, the classification result memory 3006, 
the cluster characteristics display 3007, the cluster 
characteristics calculator 3008, the classification category 
memory 3009, the cluster selection specifier 3010, and the 
classification category viewing operator 3011 are controlled 
10 by command processing of a CPU 201, a CPU 301, and the like, 
in compliance with commands written in programs recorded in 
recording media such as a ROM 202, a ROM 302, a RAM 203, a RAM 
303, or a disk device 306, and a hard disk 316. 

Here, the input section 3001 inputs document data, and 
15 for example comprises an I/F 204, or an I/F 309, or the like, 
capable of obtaining documents and groups of documents via 
keyboards 209 or 311, a scanner 313 comprising an OCR function, 
and a network 103. 

Furthermore, in addition to the above, if the input 
20 section 3001 is capable of extracting document data, it may 
comprise all the above parts. For example, when the document 
data is saved in a data base, and the medium in which the data 
base is stored is provided in the document processor of the first 
embodiment, document data is input. 
25 Furthermore, the language analyzer 3002 obtains 



language-analyzed information by analyzing document data input 
by the input section 3001. The vector creator 3003 creates a 
document characteristics vector for the document data, based 
on the language-analyzed information obtained from the language 
analyzer 3002. 

Furthermore, the classifier 3004 classifies documents 
based on the degree of similarity between document 
characteristic vectors created by the vector creator 3003, and 
creates clusters of documents. The classification parameter 
specifier 3005 specifies classification parameters, and for 
example comprises the I/F 204 or 309, or the like, capable of 
obtaining documents and groups of documents via the keyboards 
209 or 311, the mouses 210 or 312, or the network 103. 

Furthermore, the classification result memory 3006 
stores the classification result obtained by the classifier 
3004, that is, information relating to clusters of classified 
documents. Furthermore, the cluster characteristics display 
3007 displays cluster characteristics calculated by the cluster 
characteristics calculator 3008. 

The cluster characteristics calculator 3008 calculates 
cluster characteristics, which are characteristics of document 
clusters created by the classifier 3004. Furthermore, the 
classification category memory 3009 stores the cluster 
characteristics, calculated by the cluster characteristics 
calculator 3008, as constitution elements of classification 

65 



categories. Furthermore, the classification category memory 
3009 stores clusters of documents, selected by the cluster 
selection specifier 3010, as constitution elements of 
classification categories. That is, it stores all or some of 
the documents belonging to clusters selected by the cluster 
selection specifier 3010 as constitution elements of 
classification categories. 

The cluster selection specifier 3010 selects desired 
clusters from among the multiple cluster characteristics 
displayed by the cluster characteristics display 3007. 
Furthermore, the cluster selection specifier 3010 selects 
desired clusters of document from among the clusters of 
documents created by the classifier 3004. Furthermore, the 
classification category viewing operator 3011 controls viewing 
of data stored in the classification category memory 3009. 

Next, there will be explained an appropriate example in 
which it is important to extract a topic (contents) contained 
in a document cluster, by imagining an analysis of free 
responses collected through a questionnaire or the like. 

In recent years, it has become possible to collect 
thousands to tens of thousands of free responses in a short 
period of time via the Internet or the like . Using this function, 
a large amount of textual information can be gathered. 

As an example of a large amount of textual information 
collected through a questionnaire or the like, documents 



containing written answers given in response to the question: 
"Please give an example of wasteful office networking". A 
document cluster is a cluster of single responses. 

Here, the operator {the questionnaire analyzer) may want 
5 to know a summary of the opinions expressed, that is, what type 
of opinions (topics) are contained in the cluster of opinions 
(document cluster) . To fulfil this requirement, the topic is 
extracted by gathering together (classifying) similar opinions , 
so as to extract information relating to the kind of opinions 

10 that are contained in the result of the questionnaire. 

Document classification typically comprises the 
following three clearly divided steps. In the first Step, the 
language analyzer 3002 extracts words (or specific continuous 
rows of letters) contained in each of the documents (opinions) 

15 input by the input section 3001. At this time, for example, 
a language analysis algorithm such as a format element sign is 
used. 

In the second Step, a "word" x "document" matrix is 
created using the extracted words as rows, the documents as 

20 lines, and the word incidence as components. In addition to 
word extraction using language analysis tools having a format 
element analysis function and a syntax analysis function, other 
information such as speech-part information, phrases, and 
syntax information, can be obtained simultaneously, and can be 

25 considered when creating the above "word" x "document" matrix. 



Based on the "word" x "document" matrix, the vector 
creator 3003 expresses the documents as vectors in 
multidimensional space comprising words . This is accomplished 
by one of the following methods, all of which are implemented 
in the embodiments of the present invention. 

(1) use the row elements of the matrix directly ; 

(2) append values representing the importance of the 
documents after considering the length of the documents (number 
of letters, number of pages, etc. ) and the incidence of the words 
in all the classified clusters; 

(3) calculate an inner product matrix between documents 
from the above matrix, and apply specific value analysis (for 
example, by using factor analysis or main element analysis, 
third-type quantified logic, and the like) , to form dormant 
meaningful space. 

Furthermore, it is also possible to use the method 
described in "Representing documents using an explicit model 
of their similarities" (Authors: Brian T . Bartell, Garrison W. 
Cottrell, and Richard K. Belew; Paper Title: Journal of the 
American Society for Information Science; Academic Body: The 
American Society for Information Science; Pages: 254-271, Vol. 
46 No. 4; Year of Publication: 1995)", wherein the method for 
converting to dormant meaningful space is generalized, and 
joint reference information and the like, created from 
reference information of the document for other documents, is 



appended to the inner product matrix between documents, and this 
matrix is used to lead out expression space conversion 
coefficients for projecting documents and words to space 
reflecting their similarities. 

In the third Step, the classifier 3004 classifies the 
documents using the degree of similarity of the document 
characteristic vectors. More specifically, the documents are 
classified by a method such as square contingency, 
discriminatory analysis, or clustering. 

Furthermore, the degree of similarity may be measured by 
the inner product, the cosine, the Euclidean distance, the 
Mahalanobis distance, or the like. Any of these methods can 
be used in the present embodiment. 

Furthermore, there are many conventionally known 
clustering algorithms. Clustering is generally divided into 
layered clustering and non-layered clustering, but either can 
be used in the present embodiment. 

Furthermore, the classification parameter specifier 3005 
specifies classification parameters to enable the classifier 
3004 to classify the document characteristic vectors. The 
classifier 3004 classifies the document characteristic vectors 
it is saving, in compliance with classification parameters 
specified by the classification parameter specifier 3005. 

Thus, when the first document classification, comprising 
the processes of the first to third Steps, has ended, the 



classification result memory 3006 stores the classification 
result . 

Following this, the cluster characteristics calculator 
3008 calculates characteristics showing what kind of clusters 
have been obtained in the classification result, that is, it 
calculates cluster characteristics. Typically, it calculates 
the documents, or some of the documents, belonging to each 
cluster, and sorts the documents based on their degree of 
similarity with the center of the cluster. 

In addition, numerical values representing standard 
deviation within the cluster, showing the word with the highest 
incidence, the number of documents belonging to the cluster, 
the level of variation of documents within the cluster, are 
calculated to represent cluster characteristics. 

The cluster information is calculated in order to inform 
the operator what kinds of clusters (i.e. what kind of 
characteristics they possess) have been output (displayed), and 
as long as the cluster information shows cluster 
characteristics to the operator, other types of contents 
(characteristics) than the above may be used. 

Furthermore, in addition to displaying cluster 
characteristics as above, the cluster characteristics 
calculator 3008 also calculates information representing the 
relationship between clusters. In the case of layered 
clustering, the upper or lower cluster is calculated, and in 

70 



the case of non-layered clustering, adjacent clusters are 
calculated based on their degree of similarity to the cluster 
center . 

Next, the cluster characteristics display of the cluster 
characteristics display 3007 and cluster selection will be 
explained. Fig. 31 is a diagram explaining an example of a 
display of the cluster characteristics display 3007 of the 
document classification device according to the second 
embodiment . 

In Fig. 31, each cluster comprises items such as a 
"cluster ID" column 3101, a "number of members" column 3102, 
a "words of high incidence" column 3103, a "document contents" 
column 3104, and a "degree of similarity to center" column 3105, 
thereby enabling the operator to operate the display in units. 

The "cluster ID" column 3101 displays serial numbers 
showing the cluster IDs . The "number of members" column 3102 
displays the calculated number of documents, or some of the 
documents, belonging to the cluster. The words having the 
highest incidence in these documents are extracted and 
displayed in the "words of high incidence" column 3103. The 
contents of the documents are displayed in the "document 
contents" column 3104, and the degree of similarity to the 
center is expressed in numerical form and displayed in the 
"degree of similarity to center" column 3105. This makes it 
easier for the operator to understand the information. 



The operator can detect the characteristics of the 
clusters based on the information (amount of characteristics) 
displayed. Here, when there is one cluster whose contents 
(characteristics) can be understood, it can be selected by the 
cluster selection specifier 3010. 

More specifically, by moving the cursor 3110 to a 
predetermined position of the displayed cluster, for example 
to the "cluster ID" column 3101 using the mouse 210 or 312 or 
the like, and clicking on that position, the entire cluster of 
that cluster ID can be selected. It is acceptable to select 
some, rather than all, of the documents belonging to the 
selected cluster. 

In Fig. 31, the "cluster ID" column 3101 has been clicked, 
whereby the entire cluster is displayed in inverse video, and 
the cluster (cluster ID "1") is selected. 

Furthermore, when there is no cluster with comprehensible 
contents, the operator resets the classification parameters 
using the classification parameter specifier 3005, and executes 
another classification. 

Data relating to the cluster ID selected by the cluster 
selection specifier 3010 is transmitted to the classification 
category memory 3009 . The classification category memory 3009 
retrieves and stores the above amount of characteristics from 
the cluster characteristics calculator 3008, based on the data 
relating to the cluster ID. 



Similarly, the classification category memory 3009 
retrieves and stores the classification result from the 
classification result memory 3006. Moreover, the 

classification category memory 3009 can simultaneously store 
information representing comments (e.g. "network maintenance 
cost is high") about clusters input by the operator. Storing 
information created by the operator as constituent elements of 
the classification category in this way increases the 
utilizable value of the classification category. 

When an interface for other viewing operations is 
provided, data stored in the classification category memory 
3009 can be structured and categorized manually, or 
automatically by using the degree of similarity of the stored 
clusters to the cluster center, while viewing contents of 
selected and stored clusters, and pinpointing meaningful 
connections therebetween. 

Next, a processing sequence of the document 
classification device according to the second embodiment will 
be explained. Fig. 32 is a flowchart showing a processing 
sequence of the document classification device according to the 
second embodiment. In the flowchart of Fig. 32, firstly, the 
document to be classified is input (Step S3201) . 

Next, the language of the input document is analyzed (Step 
S3202), a document characteristic vector is created based on 
the result of the analysis, that is, based on the extracted words 



(Step S3203) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
has been specified (YES in Step S3204), the document is 
classified in compliance with the specified classification 
parameter (Step S3205) , and the result, that is, information 
relating to the clusters, is stored (Step S3206) . 

Next, the characteristics of the classified clusters are 
calculated (Step S3207), and the calculated results are 
displayed (Step S3208) . It is determined whether any of the 
displayed clusters has been selected (Step S3209) , and if not 

(NO in the Step S3209) , processing shifts to the Step S3204 and 
waits once more for a classification parameter to be specified 

(Step S3204) . 

On the other hand, when it is determined in the Step S3209 
that a cluster has been selected (YES in the Step S3209) , a 
classification category for the selected cluster is created and 
stored (Step S3210) . At this time information relating to 
clusters input by the operator can also be stored. Here, the 
processing series ends. 

As described above, according to the document 
classification device of the second embodiment, an expression 
space conversion coefficient, for converting the documents to 
expression space capable of projecting the meaningful 
connections between the documents, is calculated based on the 



degree of similarity between documents in document clusters to 
be classified, and the documents are classified in the 
expression space. Consequently, the documents can be 
classified in a manner that reflects the intentions of the 
5 operator. 

Therefore, clusters can be obtained from the classifier 
3004, and in addition, the clusters can be structured and 
categorized based on their contents by the cluster 
characteristics calculator 3008 and the classification 

10 category memory 3009, using the degree of similarity of the 
clusters to the cluster center and the like. 

Furthermore, it is possible to structure and categorize 
clusters closer to the intentions of the operator by using only 
the clusters selected by the cluster selection specifier 3010. 

15 In addition to the second embodiment described above, a 

vector memory and a vector corrector may be added to the 
constitution as in the third embodiment described below. 

Since the information processing system comprising the 
document classification device according to the third 

20 embodiment of the present invention is the same as the first 
embodiment shown in Fig. 1, further explanation will be omitted. 
Furthermore, since the hardware constitutions of the server 101 
and the clients 102 are the same as the first embodiment shown 
in Figs. 2 and 3, explanation thereof will be omitted. 

25 Next, the functional constitution of a document 



classification device according to the third embodiment will 
be explained. Fig. 33 is a block diagram showing a functional 
constitution of the document classification device according 
to the third embodiment. In Fig. 33, like members to those in 
Fig. 30 of the second embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

In the block diagram of Fig. 33, the document 
classification device comprises an input section 3001, a 
language analyzer 3002, a vector creator 3003 , a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 
3301, and a vector corrector 3302. 

The vector memory 3301 stores document characteristic 
vectors created by the vector creator 3003. Furthermore, the 
vector corrector 3302 corrects document characteristic vectors, 
stored in the document characteristic vector memory 3301, by 
deleting document characteristic vectors of documents 
belonging to the portion of clusters selected by the cluster 
selection specifier 3010. 

Furthermore, the classifier 3004 classifies the 
documents based on the document characteristic vectors 
corrected by the vector corrector 3302. 



The vector memory 3301 and the vector corrector 3302 are 
controlled in accordance with commands from the CPU 201 and 301, 
and the like, in compliance with commands written in programs 
recorded in recording media such as a ROM 202 and 302, a RAM 
5 203 and 303, or a disk device 306, and a hard disk 316. 

The document characteristic vectors (row vectors) and 
word (word characteristics) vectors (line vectors) are created 
in the vector creator 3003, and stored in the vector memory 3301 . 
This is in order to secure the document characteristic vectors 

10 to be used in subsequent classifications. 

The vector corrector 3302 deletes all or some of the 
documents belonging to the clusters selected by the cluster 
selection specifier 3010, so that these documents are also 
deleted from subsequent classifications . The deleted document 

15 characteristic vectors are stored in the vector memory 3301. 

As a result, of the vector data being stored in the vector 
memory 3301, the data to be used in subsequent classifications 
are those whose document (or a part thereof, as specified by 
the operator) row vectors belong to the selected clusters. 

20 Next, a processing sequence of the document 

classification device according to the third embodiment will 
be explained. Fig. 34 is a flowchart showing a processing 
sequence of the document classification device according to the 
third embodiment. In the flowchart of Fig. 34, firstly, the 

25 document to be classified is input (Step S3401) . 



Next, the language of the input document is analyzed (Step 
S3402) , a document characteristic vector is created based on 
the result of the analysis, that is, based on the extracted words 
(Step S3403), and the created document characteristic vectors 
5 are stored (Step S3404) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
has been specified (YES in Step S3405) , the document is 
classified in compliance with the specified classification 

10 parameter (Step S3406) , and the result, that is, information 
relating to the clusters, is stored (Step S3407) . 

Next, the characteristics of the classified clusters are 
calculated (Step S3408), and the calculated results are 
displayed (Step S3409) . It is determined whether any of the 

15 displayed clusters has been selected (Step S3410), and if not 
(NO in the Step S3410) , the processing shifts to the Step S3405 
and waits once more for a classification parameter to be 
specified (Step S3405) . 

On the other hand, when it is determined in the Step S3410 

20 that a cluster has been selected (YES in the Step S3410), a 
classification category for the selected cluster is created and 
stored (Step S3411) . At this time information relating to 
clusters input by the operator can also be stored. Thereafter, 
it is determined whether a repeat of the processing has been 

25 specified (Step S3412) . 



In the Step S3412, when a repeat of the processing has 
been specified (YES in Step S3412) , all or some of the documents 
belonging to the selected clusters are deleted by document 
characteristic vector correction (Step S3413) . Thereafter, 
5 the processing shifts to the Step S3405, and all the processes 
from the Steps S3405 to S3413 are repeated. 

On the other hand, in the Step S3412, when a repeat of 
the processing has not been specified (NO in the Step S3412) , 
the processing series ends. 

10 As described above, according to the document 

classification device of the third embodiment, the vector 
memory 3301 creates a new cluster in which the effects of 
clusters which are already known is removed. 

In the third embodiment described above, a vector memory 

15 and a vector corrector are added to the constitution, but a 
document expression space corrector may be added instead of the 
vector corrector, as in a fourth embodiment described below. 

Since the information processing system comprising the 
document classification device according to the fourth 

20 embodiment of the present invention is the same as the first 
embodiment shown in Fig. 1, further explanation will be omitted. 
Furthermore, since the hardware constitutions of the server 101 
and the clients 102 are the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 

25 will be omitted. 



Next, the functional constitution of a document 
classification device according to the fourth embodiment will 
be explained. Fig. 35 is a block diagram showing a functional 
constitution of the document classification device according 
to the fourth embodiment. In Fig. 35, like members to those 
in Fig. 30 of the second embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

In the block diagram of Fig. 35, the document 
classification device comprises an input section 3001, a 
language analyzer 3002, a vector creator 3003, a classifier 3004, 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 
3501, and a document expression space corrector 3502. 

The vector memory 3501 stores document characteristic 
vectors created by the vector creator 3003. Furthermore, the 
document expression space corrector 3502 corrects the document 
expression space, used when determining the degree of 
similarity between document characteristics vectors stored in 
the document characteristic vector memory 3501, based on an 
amount of characteristics calculated from the portion of 
clusters selected by the cluster selection specifier 3010. 

Furthermore, the classifier 3004 classifies the 
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documents using the document expression space corrected by the 
document expression space corrector 3502, based on the degree 
of similarity between the document characteristic vectors 
created by the vector creator 3003. 
5 The vector memory 3501 and the document expression space 

corrector 3502 are controlled in accordance with commands from 
the CPU 201 and 301, and the like, in compliance with commands 
written in programs recorded in recording media such as a ROM 
202 and 302, a RAM 203 and 303, or a disk device 306, and a hard 
10 disk 316. 

Next, the contents of the document expression space 
corrector 3502 will be explained. In the vector corrector 3302 
in the third embodiment, document characteristic vectors were 
deleted to eliminate the effects of clusters that were already 

15 known, but the multidimensional space in which the document 
characteristic vectors are expressed was not altered. 

Therefore, when format characteristics of clusters 
selected by the operator in the previous classification are to 
be eliminated from the next classification, the space in which 

20 the document characteristic vectors are expressed must itself 
be altered. 

The document expression space corrector 3502 is provided 
for this purpose, and corrects the document expression space. 
Here, as example where the characteristic dimensions of the 
25 document expression space is altered by deleting the 



characteristic dimension having a high degree of similarity 
with the center of a cluster selected by the operator. 

Since the center of a cluster selected by the operator 
can be expressed as a vector, the degree of similarity between 
5 this cluster center vector and the characteristic dimensions 
of the document expression space stored in the vector memory 
3501 is calculated, so as to identify the characteristic 
dimensions with a high degree of similarity. 

The cosine, inner product, the Euclidean distance, the 

10 Mahalanobis distance, or the like, is used to measure the degree 
of similarity. Furthermore, characteristic dimensions with a 
high degree of similarity can be identified by threshold value 
processing, in which characteristic dimensions with a degree 
of similarity exceeding a certain degree of similarity are 

15 deleted, or fixed-number processing, in which a fixed number 
of characteristic dimensions with a high degree of similarity 
are deleted. Furthermore, discriminatory analysis or the like 
can be performed. 

The document express space corrector 3502 deletes the 

20 characteristic dimensions after calculating those which are to 
be deleted. Deletion is carried out by deleting the line 
vectors of characteristic dimensions identified from a matrix 
of "characteristic dimensions (words)" x "documents" stored in 
the vector memory 3501. The document vectors corrected by the 

25 document express space corrector 3502 are stored in the vector 



memory 3501 to be used in subsequent classifications. 

Next, a processing sequence of the document 
classification device according to the fourth embodiment will 
be explained. Fig. 36 is a flowchart showing a processing 
5 sequence of the document classification device according to the 
fourth embodiment. In the flowchart of Fig. 3 6, firstly, the 
document to be classified is input (Step S3601) . 

Next, the language of the input document is analyzed (Step 
S3602) , a document characteristic vector is created based on 

10 the result of the analysis, that is, based on the extracted words 
(Step S3603) , and the created document characteristic vectors 
are stored (Step S3604) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 

15 has been specified (YES in Step S3605) , the document is 
classified in compliance with the specified classification 
parameter (Step S3606), and the result, that is, information 
relating to the clusters, is stored (Step S3607) . 

Next, the characteristics of the classified clusters are 

20 calculated (Step S3608) , and the calculated results are 
displayed (Step S3609) . It is determined whether any of the 
displayed clusters has been selected (Step S3610) , and if not 
(NO in the Step S3610) , the processing shifts to the Step S3605 
and waits once more for a classification parameter to be 

25 specified (Step S3605) . 



On the other hand, when it is determined in the Step S3 610 
that a cluster has been selected (YES in the Step S3610) , a 
classification category for the selected cluster is created and 
stored (Step S3611) . At this time, information relating to 
clusters input by the operator can also be stored. Thereafter, 
it is determined whether a repeat of the processing has been 
specified (Step S3612) . 

In the Step S3612, when a repeat of the processing has 
been specified (YES in Step S3612), the document expression 
space is corrected by deleting the line vectors of the 
characteristic dimensions identified from the matrix 
"characteristic dimensions (words) " x "documents" (Step S3 613) . 
Thereafter, the processing shifts to the Step S3605, and all 
the processes from the Steps S3605 to S3613 are repeated. 

On the other hand, in the Step S3612, when a repeat of 
the processing has not been specified (NO in the Step S3612) , 
the processing series ends. 

As described above, according to the document 
classification device according to the fourth embodiment, 
format characteristics of a cluster selected by the operator 
in a previous classification can be deleted from subsequent 
classifications by the document express space corrector 3502, 
enabling a new cluster to be created in the deleted state. 

In the third and fourth embodiments described above, 
either one of a vector corrector and a document express space 
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corrector are added to the constitution, but both the vector 
corrector and the document expression space corrector may be 
added instead of the vector corrector, as in a fifth embodiment 
described below. 
5 Since the information processing system comprising the 

document classification device according to the fifth 
embodiment of the present invention is the same as the first 
embodiment shown in Fig . 1, further explanation will be omitted . 
Furthermore, since the hardware constitutions of the server 101 

10 and the clients 102 are the same as the first embodiment shown 
in Figs . 2 and 3, in order to avoid repetition, their explanation 
will be omitted. 

Next, the functional constitution of a document 
classification device according to the fifth embodiment will 

15 be explained. Fig. 37 is a block diagram showing a functional 
constitution of the document classification device according 
to the fifth embodiment. In Fig. 37, like members to those in 
Fig. 30 of the second embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

20 In the block diagram of Fig. 37, the document 

classification device comprises an input section 3001, a 
language analyzer 3002, a vector creator 3003, a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 

25 cluster characteristics calculator 3008, a classification 



category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 
3701, a vector corrector 3702, and a document expression space 
corrector 3703. 

5 The vector memory 3701 stores document characteristic 

vectors created by the vector creator 3003. Furthermore, the 
vector corrector 3702 corrects the document characteristic 
vectors, stored in the document characteristic vector memory 
3301, by deleting document characteristic vectors of documents 
10 belonging to the portion of clusters created by the classifier 
3004. 

Furthermore, the document expression space corrector 
3703 corrects the document expression space, used when 
determining the degree of similarity between document 
15 characteristics vectors stored in the document characteristic 
vector memory 3701, based on the characteristics of clusters 
selected by the cluster selection specifier 3010. 

Furthermore, the classifier 3004 classifies the 
documents based on the degree of similarity between document 
20 characteristic vectors corrected by the vector corrector 3702, 
using the document expression space corrected by the document 
expression space corrector 3703. 

The vector memory 3701, the vector corrector 3702, and 
the document expression space corrector 3703 are controlled in 
25 accordance with commands from the CPU 201 and 301, and the like, 



in compliance with commands written in programs recorded in 
recording media such as a ROM 202 and 302, a RAM 203 and 303, 
or a disk device 306, and a hard disk 316. 

Next, the contents of the vector corrector 3702 and the 
5 document expression space corrector 3703 will be explained. In 
the fourth embodiment, documents belonging to a selected 
cluster are used in subsequent classifications. 

In the fifth embodiment, since the vector corrector 3702 
and the document expression space corrector 3703 are both 

10 provided, documents belonging to selected clusters are deleted 
from subsequent classifications, and are not classified in 
subsequent classifications. 

In the fourth embodiment, the aspect of topic extraction 
is emphasized, and it is assumed that a given document can be 

15 classified under multiple topics. For example, in an 
investigation into networking, the following answer is given: 
"The end user enquires about how to install the software, and 
so cannot work as a system manager.". This can be classified 
under the topic of "difficulties relating to understanding the 

20 software operation", but can also be classified under the topic 
of "busy nature of system manager work". 

The fourth embodiment addresses the need to be able to 
extract both the cluster "difficulties relating to 
understanding the software operation" and the cluster "busy 

25 nature of system manager work". 



Conversely, since the operator already knows topics which 
have been extracted once, there will be cases when he or she 
desires a different result from the next classification. The 
fifth embodiment addresses this requirement by providing the 
5 vector corrector 3702, thereby ensuring that all or part of 
documents belonging to clusters selected in the nth 
classification are deleted from subsequent classifications. 

Documents belonging to clusters which have been specified 
for selection by the cluster selection specifier 3010 are stored 
10 in row vector format in the vector memory 3701. Therefore, 
document clusters for subsequent classification are created by 
deleting these row vectors using the vector corrector 3702. 

Moreover, as in the fourth embodiment, in accordance with 
the selected clusters, the document expression space corrector 
15 3703 deletes the characteristic dimension from the matrix 
stored in the vector memory 3701. 

Next, a processing sequence of the document 
classification device according to the fifth embodiment will 
be explained. Fig. 38 is a flowchart showing a processing 
20 sequence of the document classification device according to the 
fifth embodiment. In the flowchart of Fig. 38, firstly, the 
document to be classified is input (Step S3801) . 

Next, the language of the input document is analyzed (Step 
S3802) , a document characteristic vector is created based on 
25 the result of the analysis, that is , based on the extracted words 



(Step S3803), and the created document characteristic vector 
is stored (Step S3804) . 

Thereafter, the process waits for a classification 
parameter to be specified, and when a classification parameter 
has been specified (YES in Step S3805) , the document is 
classified in compliance with the specified classification 
parameter (Step S3806) , and the result, that is, information 
relating to the clusters, is stored (Step S3807) . 

Next, the characteristics of the classified clusters are 
calculated (Step S3808) , and the calculated results are 
displayed (Step S3809) . It is determined whether any of the 
displayed clusters has been selected (Step S3810) , and if not 
(NO in the Step S3810) , the processing shifts to the Step S3805 
and waits once more for a classification parameter to be 
specified (Step S3805) . 

On the other hand, when it is determined in the Step S3810 
that a cluster has been selected (YES in the Step S3810) , a 
classification category for the selected cluster is created and 
stored (Step S3811) . At this time, information relating to 
clusters input by the operator can also be stored. Thereafter, 
it is determined whether a repeat of the processing has been 
specified (Step S3812) . 

In the Step S3812, when a repeat of the processing has 
been specified (YES in Step S3812) , all or some of the documents 
belonging to the selected clusters are deleted by document 
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characteristic vector correction (Step S3813) . 

Following the Step S3813, the document expression space 
is corrected by deleting the line vectors of the characteristic 
dimensions identified from the matrix "characteristic 
5 dimensions (words)" x "document" (Step S3814) . Thereafter/ 
the processing shifts to the Step S3805, and all the processes 
from the Steps S3805 to S3814 are repeated. 

On the other hand, in the Step S3812, when a repeat of 
the processing has not been specified (NO in the Step S3812) , 

10 the processing series ends. 

As described above, according to the document 
classification device of the fifth embodiment, the vector 
corrector 3702 eliminates the effects of clusters which are 
already known, and in addition, the document expression space 

15 corrector 3703 eliminates the format characteristics of a 
cluster selected by the operator in a previous classification 
from subsequent classifications , thereby enabling a new cluster 
to be created in the deleted state. 

In the second and fourth embodiments described above, 

20 when classification was repeatedly carried out, no 
consideration was given to information relating to how many 
times a document was selected, but when the constitution 
comprises a selection information appender, as in a sixth 
embodiment described below, selection information can be 

25 displayed together with cluster characteristics. 



Since the information processing system comprising the 
document classification device according to the sixth 
embodiment of the present invention is the same as the first 
embodiment shown in Fig. 1, further explanation will be omitted . 
Furthermore, since the hardware constitutions of the server 101 
and the clients 102 are the same as the first embodiment shown 
in Figs. 2 and 3, in order to avoid repetition, their explanation 
will be omitted. 

Next, the functional constitution of a document 
classification device according to the sixth embodiment will 
be explained. Fig. 39 is a block diagram showing a functional 
constitution of the document classification device according 
to the sixth embodiment. In Fig. 39, like members to those in 
Fig. 35 of the fourth embodiment are represented by like 
reference symbols, and explanation thereof is omitted. 

In the block diagram of Fig. 39, the document 
classification device comprises an input section 3001, a 
language analyzer 3002, a vector creator 3003 , a classifier 3004 , 
a classification parameter specifier 3005, a classification 
result memory 3006, a cluster characteristics display 3007, a 
cluster characteristics calculator 3008, a classification 
category memory 3009, a cluster selection specifier 3010, a 
classification category viewing operator 3011, a vector memory 
3501, a document expression space corrector 3502, and a 
selection information appender 3901. 



When all or some documents belonging to a cluster portion 
of documents created by the classifier 3004 have been selected, 
the selection information appender 3901 appends selection 
information showing that the documents have been selected. 
Furthermore, the cluster characteristics display 3007 displays 
the cluster characteristics, and the selection information 
appended by the selection information appender 3901. 

The selection information appender 3901 are controlled 
in accordance with commands from the CPU 201 and 301, and the 
like, in compliance with commands written in programs recorded 
in recording media such as a ROM 202 and 302, a RAM 203 and 303, 
or a disk device 306, and a hard disk 316. 

Next, the detailed contents of the selection information 
appender 3 901 will be explained. In a questionnaire, 
experience has taught that unique and highly opinionated 
answers are extremely important. This is because many answers 
could not have been anticipated by the person who planned the 
questionnaire . 

Accordingly, in a case where documents belonging to a 
cluster selected by the operator are used in subsequent 
classifications, it is possible to improve the ability to 
identify documents used on multiple occasions, and also the 
ability to identify documents which have not been selected at 
all, by showing how many times the documents have been selected 
when the cluster characteristics display 3007 displays the 
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individual documents. 

Fig. 40 is a diagram explaining a table 4000 provided in 
the classification result memory 3006 of the document 
classification device according to the sixth embodiment. In 
Fig. 40, table contents are listed for each document ID, and 
the table 4000 shows in which cycle each document was selected 
by the operator during classification . That is , when a document 
has been selected, selection information of "1" is entered, and 
when a document has not been selected, selection information 
of "0" is entered. 

For example, when a document has been selected four times, 
the table 4000 shows that document ID "1" was selected by the 
operator in the first and second classifications, but was not 
selected in the third and fourth classifications . On the other 
hand, document ID "2" has not once been selected yet, indicating 
that it is an opinion unknown to the operator. 

Based on such information, when the cluster 
characteristics display 3007 displays the documents to the 
operator, the display may for example be altered in accordance 
with the number of times the documents have been selected. For 
example, visual characteristics such as the color of the letters, 
the density of the background, and the color intensity may 
conceivably be altered. 

Furthermore, the number of selections can be directly 
displayed by numerical symbols, graphs, or the like. In any 
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case, as long as it is possible to visually identify selected 
documents and unselected documents, the constitution is not 
limited to that described above. 

Furthermore, the selection information may be viewed 
5 using the classification category viewing operator 3011. 

Next, the processing performed by the selection 
information appender 3901 will be explained. Fig. 41 is a 
flowchart showing a processing sequence of the selection 
information appender 3901 of the document classification device 

10 according to the sixth embodiment. In the flowchart of Fig. 
41, firstly, classification is carried out (Step S4101), and 
then, the first document is extracted (Step S4102) . 

It is determined whether the extracted document has been 
selected for classification in the Step S4101 (Step S4103) . 

15 Here, when the document has been selected (YES in the Step S4103) , 
data "1" is stored as the selection information (Step S4104) . 
On the other hand, when the document has not been selected (NO 
in the Step S4103) , data "0" is stored as the selection 
information (Step S4105) . 

20 Next, it is determined whether or not the processing of 

the document has ended (Step S4106) . Here, when all of the 
documents have not been processed (MO in the Step S4106), the 
next document is extracted (Step S4107) , the processing shifts 
to the Step S4103, and the Steps S4103 to S4107 are repeated. 

25 On the other hand, in the Step S4106, when all the 



documents have been processed (YES in the Step S4106) , the 
processing shifts to the Step S4101, and classification is 
performed again (Step S4101) . In this way, the number of times 
that the processing between the Steps S4101 to S4107 is repeated 
5 is equal to the number of classifications. 

As explained above, according to the sixth embodiment, 
the selection information appender 3901 appends selected 
information, which is displayed by the cluster characteristics 
display 3007, and consequently, it is possible to improve the 

10 ability to identify documents used on multiple occasions, and 
also the ability to identify documents which have not been 
selected at all. 

The document classification method described in the 
second to fifth embodiments is realized by running a 

15 predetermined program on a computer, such as a personal computer 
or a work station. The program is recorded on a computer- 
readable recording medium such as a hard disk, a floppy disk, 
a CD-ROM, an MO, or a DVD, and is executed by reading out the 
program from the recording medium using the computer. 

20 Furthermore, the program can be distributed via the recording 
medium, or by using a network such as the Internet as a 
transmission medium. 

Next, an information classification device according to 
the seventh to sixteenth embodiments will be explained. In the 

25 present embodiment of the present invention, when one or more 



collections of sentences written in a natural language is/are 
to be classified, this will be termed a document. By way of 
a more specific example, patent laid-open publications 
classified by IPC classification, or newspaper articles 
5 classified into specific fields such as politics, economics, 
culture, science and technology, and the like, are documents. 
When claims and specific sentences are extracted therefrom, 
these are regarded either as sentences under the classification 
of "claims", or, in the case of specific sentences which can 

10 be classified according to intended usage, these are regarded 
as documents. There follows a detailed description of the 
seventh to sixteenth embodiments of the present invention based 
on the drawings . 

Fig. 42 is a block diagram showing a constitution of a 

15 document classification device according to the seventh 
embodiment of the present invention. As shown in Fig. 42, the 
document classification device of the seventh embodiment 
comprises a document input section (document input means) 5001 
for inputting document data groups, a document divider 

20 (document dividing means ) 5002 for dividing document data into 
one or multiple divided document data based on a predetermined 
reference, a document-divided document map creator 
(document-divided document map creation means) 5003 for 
creating a map showing the correspondence between the document 

25 data and the divided document data, a divided document 



classifier (divided document classifying means) 5004 for 
classifying the divided document data, that is, the divided 
document, a divided document classification result creator 
(divided document classification result creation means) 5005 
for creating divided document classification result 
information, a document classification result creator 
(document classification result creation means) 5006 for 
creating classification result information of the above 
document data using the document-divided document map and the 
divided document classification result information, etc. 

The document divider 5002, the document-divided document 
map creator 5003, the divided document classifier 5004, the 
divided document classification result creator 5005, and the 
document classification result creator 5006 have a shared or 
independent memory for storing programs and a CPU, which 
operates in compliance with the programs. 

Next, the document classification device and the document 
classification method of the seventh embodiment will be 
explained in detail in accordance with Fig. 42 and the like. 
Firstly, the document input section 5001 inputs a group of 
documents. The document input section 5001 comprises a 
keyboard, an OCR device, a detachable recording medium, or 
network communications means, and the documents are input via 
any one of these. 

Then, document divider 5002 extracts the document data, 
97 



divides them based on a predetermined reference, and creates 
one or multiple divided document data from one document data. 
The document data is divided using a method specified by the 
user, such as using information relating to the structure of 
5 the documents, or information relating to the constituents of 
the documents. The method used will not be considered here. 

Fig. 43 shows an example of creating multiple divided 
document data from document data using the document 
classification device and document classification method of the 

10 present invention. In this example, a document 1 comprises 
multiple news topics, and each one-minute topic forms one 
document unit. As shown in Fig. 43, the news topics are 
separated by two line-break codes. The document 1, comprising 
one document, is divided using this stipulation to create seven 

15 divided document data of divided documents 1-1 to 1-7, each 
comprising a separate topic. It is also possible to include 
the document 1 in its state prior to division in the data, but 
this is not done here. 

When the document has been divided, the document-divided 

20 document map creator 5003 creates a map showing the document 
data prior to division in correspondence with the divided 
document data created from the document data. For example, the 
document-divided document map creator 5003 creates a map 
comprising identifiers uniquely representing individual 

25 document data, and identifiers uniquely representing 



individual divided document data, or a map comprising 
identifiers uniquely representing divided document data for 
each document data. The method for arranging the document data 
and divided document data in mutual correspondence will not be 
considered here. 

Fig. 4 4 shows an example of creating a document-divided 
document map. In Fig. 44, the documents 1 to 3 represent 
document data, and the divided documents 1 to 12 represent 
divided document data. As shown in the diagram, identification 
numbers (identifiers) for uniquely identifying the document 
data and the divided document data are appended. Then, as shown 
in the bottom left portion of Fig. 44, the identification 
numbers of the document data and the identification numbers of 
the divided document data are listed in mutual correspondence 
in table format. When multiple divided document data can be 
regarded as identical with regard to the reference used for the 
document classification, identical identification numbers may 
be appended thereto. 

Thereafter, the divided document classifier 5004 
classifies the divided documents. The divided documents can 
be classified by, for example, language-analyzing the 
individual divided documents, counting the incidence of words 
contained therein, determining a characteristics vectors 
quantitatively showing the characteristics of the documents 
based on the result of the language analysis, and then using 
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a method such as square contingency, discriminatory analysis, 
or cluster analysis . 

Next, the divided document classification result creator 
5005 creates divided document classification result 
5 information based on the result of the divided document 
classification (see Fig. 45). Here, the divided document 
classification result information comprises, for example, (a) 
information relating to categories to which the divided data 
belong (e.g. information of the items "classification category" 

10 and "representative value and distance of categories to which 
the documents belong" in the table of "Results of classifying 
divided document data into three categories" shown in Fig. 45) , 
(b) information relating to individually created categories 
(e.g. information of the items "representative value" and 

15 "number of data belonging to category (number of divided 
document)" in the table of "Information Relating to 
Classification Categories" shown in Fig. 45) , (c) information 
between created categories (e.g. information in the table of 
"Distance between Classification Categories" in Table 4), (d) 

20 and such like. The user can also use the various information 
mentioned above as basic data for analyzing the classification 
result . 

Fig. 45 shows an example of creating a classification 
result in a case where twelve divided document data are 
25 classified into three categories using their quantitative 



characteristics vectors. The quantitative three-dimensional 
vectors of the divided document data (the number of components 
of the vector is the number of all the types of words originating 
in the classified document cluster, but here, the vectors are 
5 linearly converted to three-dimensional vectors in which 
several words have been deleted) can be classified into three 
categories by utilizing a cluster analysis method such as, for 
example, Ward's method. 

That is, each of the divided document data belongs to one 

10 of the three categories shown in the diagram. The 
representative value of each category to which the divided 
document data belong is an average value of the characteristics 
vector of the divided document data which belong to the category 
(the center of the divided document data which belong to the 

15 category) . 

Furthermore, the distance (corresponding to the degree 
of similarity) to the representative value of the category to 
which the data belongs can be determined (for example, in the 
case of the divided document 3 in Fig. 45) using the value of 

20 divided document 3 in the divided document data characteristics 
vector item, and the value of the item of the representative 
value (center of the divided document category) of the category 

2 , which is the classification category for the divided document 

3, in the following equation. 

25 ( (3 . 00-2 . 66) 2 + (2.00-2.00) 2 ( 4 . 00-3 . 66 ) 2 ) H = 0.48 



Hence, the smaller the distance to the representative 
value of the category to which the divided document belongs, 
the higher the degree of similarity with the average divided 
document belonging to that category. 

In addition to the statistics shown in Fig. 45, various 
statistics can be created, such as dispersion within or between 
categories, the range of the degree of similarity in each 
category, etc. 

Then, the document classification result creator 5006 
uses the document-divided document map and the divided document 
classification result information to create classification 
result information of the document data, such as that shown in 
Fig. 4 6, for example. As shown in the example of Fig. 4 6, for 
each category, classification result information such as 
divided document data belong to each category, the degree of 
similarity thereof (distance to the representative value of the 
category to which the data belongs) , the pre-division document 
data to which the divided document data belongs (document to 
which data belongs), the area occupied by the document (the 
share of the category occupied by the divided document data), 
the relative position of the divided document data in the 
document (order), and the degree of similarity ranking of the 
divided document data within the category to which it belongs, 
are created. 

In the above example, document to which data belongs is 
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obtained from the document-divided document map, and other 
classification result information is obtained from the divided 
document classification result information. In addition to 
the information shown in Fig. 4 6, the document classification 
result creator 5006 can use various statistics, such as the 
dispersion of the data within categories, and the deviation 
value of the divided document data within the category to which 
it belongs, and the contents of the document data and the divided 
document data, and the like, as the classification result 
information . 

Furthermore, in the example described above, all the 
results are expressed in table format as units of divided 
document data, but the classification categories and document 
data can also be expressed units. Furthermore, the 
classification result information need not only be expressed 
in text format, but can also be expression graphically, making 
it more comprehensible to the user. 

Thus, according to the present invention, one document 
is divided, the divided document is classified, and the 
relationship between the document prior to division and the 
divided document is displayed to the user. Furthermore, the 
classification result of the divided document is displayed to 
the user. Therefore, when one document contains multiple 
topics and meanings, the document is not classified into 
categories limited to specific topics and meanings, or 
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classified into categories different from those desired by the 
user, making the classification categories more easily 
comprehensible to the user. Furthermore, since the position 
of the divided document in the document prior to division (the 
document to which the divided document belongs) is displayed, 
the user can efficiently read the part of the document cluster 
that he or she wants to read. 

Fig. 47 is a block diagram showing a constitution of the 
document classification device according to an eighth 
embodiment of the present invention. As shown in Fig. 47, in 
addition to the constitution shown in the seventh embodiment 
of Fig. 42, the document classification device according to the 
eighth embodiment are added (a) a document saving section 
(document saving means) 5007 for saving document data, (b) a 
divided document saving section (divided document saving means ) 
5008 for saving divided document data, and (c) a document- 
divided document map saving section (document-divided document 
map saving means) 5009 for saving a document-divided document 
map created by the document-divided document map creator 5003. 
The saving sections for example comprise shared hard disks, 
semiconductor memories, or the like. 

With the constitution described above, the document 
saving section 5007 of the present embodiment saves information 
accompanying the document, such as the contents of the document, 
the author of the document, the date of authorship, the date 



of last correction, in an appropriate format. Furthermore, 
when the document has a quantitative characteristics vector 
comprising elements of the document, in addition to the document 
contents, these are also saved in the document saving section 
5007. When identifiers uniquely expressing the individual 
document data are appended in the document input section 5001, 
the document saving section 5007 also saves these identifiers 
in an appropriate format. 

Furthermore, the divided document saving section 5008 the 
contents of the divided document data created by the document 
divider 5002 in an appropriate format, and in addition, saves 
quantitative characteristics vectors. When identifiers 
uniquely expressing the individual document data are appended, 
the divided document saving section 5008 also saves the 
identifiers in an appropriate format. 

Furthermore, the document-divided document map saving 
section 5009 saves document-divided document maps created by 
the document-divided document map creator 5003 in an 
appropriate format. 

According to the eighth embodiment, since document data, 
divided document data, and document-divided document maps are 
saved in this way, for a single document data it is possible 
to efficiently determine classification results having 
different parameters such as the number of classifications, the 
classification method, and the settings used in the 



classifications, without recreating the divided document data 
and the document-divided document map. Furthermore, by 
classifying the document data and saving the data needed to 
create the classification result, the user is free to take more 
5 time over the classification, and to re-analyze previously 
classified documents within a given period of time. 

Fig. 48 is a block diagram showing a constitution of the 
document classification device according to a ninth embodiment 
of the present invention. As shown in Fig. 48, in addition to 
10 the constitution shown in the eighth embodiment of Fig. 47, the 
document classification device of the present embodiment 
further comprises a divided document classification result 
saving section (divided document classification result saving 
means) 5010 for saving the divided document classification 
15 results created by the divided document classification result 
creator 5005. The divided document classification result 
saving section 5010 comprise, for example, a shared hard disk, 
a semiconductor memory, or the like. 

Thus, according to the ninth embodiment, since document 
20 data, divided document data, document-divided document maps, 
and divided document classification results are saved, in 
addition to the effects of the eighth embodiment, it is possible 
to express the classification result of a single classification 
in various formats, such as textual format, chart format and 
25 graph format. Moreover, since the divided document 
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classification result information is saved, during 
classifications and analysis of classification results, the 
user is free to take more time over the operations, and can 
re-analyze previously classified documents in a variety of 
5 formats within any given period of time. 

In the document classification device and document 
classification method according to the tenth embodiment of the 
present invention, as shown in Fig. 49, a document 1 comprises 
document data prior to division, and is contained in multiple 

10 divided document data created by the document divider 2. As 
a consequence, in the present embodiment, the user is able to 
obtain not only a detailed classification structure of document 
data, but also a classification structure fusing a schematic 
macro classification structure, obtained as a result of 

15 classifying the document data itself prior to division. 

In the document classification device and document 
classification method according to the eleventh embodiment of 
the present invention, the document divider 2 divides the 
document data based on structural information relating to the 

20 document data. Fig. 50 shows an example of the document 
described by classification object document data or HTML-format . 
Prior to division, structural information is extracted from 
HTML-format document data such as that shown in Fig. 50, and 
divided document data is created from document data by setting 

25 appropriate division stipulation for the documents using that 



structure . 

That is, taking the tag "LI" in the document data by way 
of example, it is a stipulation for creating divided document 
data to "treat text having tag "LI" as one divided document data" . 

5 By applying this stipulation to the document data, the seven 
divided documents shown in Fig. 50 are created. 

Even when the document does not have a specific structural 
format such as HTML, XML, SGML, as described above, a 
stipulation for division can be created from information 

10 relating to the size of the letters, the decoration of the 
letters, the color of the letters, the font, and the like, 
enabling the document to be divided. Furthermore, when the 
document data comprises an image, and is input by an OCR device 
or the like, a stipulation for division can be created using 

15 information relating to the original layout of the image, or 
the like, enabling a divided document to be created. 

It is not necessary to use all the document data for the 
divided document data. For example, in the example shown in 
Fig. 50, the letter row "News Topic (98/09/25)" is not used in 

20 the divided document. 

Thus, in the eleventh embodiment, structural information 
is extracted from the document data, and the structural 
information is used to set an appropriate stipulation for 
division prior to dividing the document . As a result, different 

25 topics are divided appropriately . Consequently, documents can 



be classified in such a manner that the detailed classification 
structure of the document data is known. 

In the twelfth embodiment, the document classification 
device and document classification method according to the 
5 seventh to tenth embodiments of the present invention, as shown 
in Fig. 51, further comprise (a) a document element analyzer 
(document element extraction means) 5011 for extracting 
elements such as words contained in the document data, and (b) 
an extractor of information accompanying elements (information 

10 accompanying elements extraction means) 5012 for extracting 
information accompanying the elements such as the part of speech 
accompanying the elements extracted by the document element 
analyzer 5011 (Fig. 51 shows an example in which the document 
element analyzer 5011 and the extractor of information 

15 accompanying elements 5012 are additionally provided to the 
ninth embodiment of Fig. 48) . The document divider 5002 divides 
the document data using the elements extracted by the document 
element analyzer 5011, and the information accompanying the 
elements extracted by the extractor of information accompanying 

20 elements 5012. 

As shown in Fig. 52, prior to division, the document 
element analyzer 5011, comprising language analysis processing 
means, extracts from the document data elements such as words, 
and the extractor of information accompanying elements 5012 

25 extracts information accompanying the elements such as the 



parts of speech, and an appropriate stipulation for division 
is set in accordance with the information . The document element 
analyzer 5011 and the extractor of information accompanying 
elements 5012 do not have to be newly provided, since similar 
means in the divided document classifier 5004 can be used 
instead. 

In this embodiment, as for example shown in Fig. 52, the 
document data comprises a group of multiple news topics having 
no specific structural information. In this example, the 
topics are listed after letter rows comprising: Word "topic" 
+ "number" + "return symbol" . The above structure is identified 
from the extraction results of the document element analyzer 
5011 and the extractor of information accompanying elements 
5012, and after considering the ends of the sentences, the 
following division stipulation is created: "With the letter row 
"topic + number + return symbol" as the header, deem a letter 
row comprising the above letter row, and a letter row surrounded 
by a document return symbol, to be one divided document data". 

More specifically, firstly, only the parts of speech and 
return symbols are extracted from the extracted words and 
information about parts of speech and the like. Then, letter 
rows "topic + number + return symbol" and document end symbols 
are detected, and their positions in the document are stored. 
Then, a division stipulation is applied to the document data, 
creating divided document data such as that shown in Fig. 52. 
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It is not necessary to use all the document data for the 
divided document data. For example, in the example shown in 
Fig. 52, the letter row "News Topic (98/09/25)" is not used in 
the divided document. Furthermore, in the above example, 
elements and information accompanying the elements is extracted 
from the document data in order to set a stipulation for division, 
but it is acceptable to extract only the elements, and to set 
a stipulation for division based only on the element 
information . 

Thus, according to the twelfth embodiment, element 
information and the like is extracted from the document data, 
and the extracted element information and the like is used to 
set an appropriate stipulation for division. Consequently, as 
same as the eleventh embodiment, the document can be classified 
in such a manner that the detailed classification structure of 
the document data is known. 

According to the thirteenth embodiment, in the document 
classification device and document classification method 
according to the seventh to the tenth embodiments, the document 
divider 5002 divides data in accordance with a specification 
range specified by the user. When the user specifies various 
divided document ranges for document data such as that shown 
in Fig. 53, the document divider 5002 divides the document in 
compliance with the specifications. 

In the present embodiment, when classifying a document, 



the document divider 5002 firstly displays on the screen left 
and right specification points, and a region specification 
object comprising region specification lines, as the 
initialized state in the upper part of the document. In this 
5 state, by using a pointing device such as a mouse to drag the 
left or right specification points and move it up and down, the 
user can select regions of the divided document. 

When making a specification, the document divider 5002 
shows that a region is being selected by changing the color of 
10 the specification pointer from dark to light, and changing the 
region specification line from a solid line to a broken line. 
To select a region, the user need only stop dragging the 
specification point at a position of his own choice. 

Next, the user decides whether or not to make the region 
15 he or she has selected into a divided document. When he or she 
decides not to do so, this decision is shown clearly by the 
document divider 5002 casting a net over the selected region 
on the screen. 

In this way, according to the present embodiment, since 
20 the user can select divided document data from document data 
as he or she wishes, he or she can learn the detailed 
classification structure of the document data. In addition, 
the user can classify documents as he or she wishes. 

According to the fourteenth embodiment, in the document 
25 classification device and document classification method 



according to the seventh to the tenth embodiments, document data 
is divided based on the number of letters, the number of 
sentences, or both the number of letters and the number of 
sentences. For example, the document data shown in Fig. 54 is 
divided into units of approximately two hundred letters. 

Here, the units each comprise approximately two hundred 
letters, since there is no guarantee that a unit of exactly two 
hundred letters will end with a full stop. Therefore, the 
nearest full stop before or after the two hundredth letter is 
deemed to be the end of the divided document. In this way, the 
divided document of Fig. 54 is created. Similarly, documents 
can be divided into units comprising a predetermined number of 
sentences, and documents can be divided based on both the number 
of letters and the number of sentences. 

Consequently, according to the fourteenth embodiment, 
since documents can be divided based on the number of letters, 
the number of sentences, or both the number of letters and the 
number of sentences, there is an increased capability to 
classify different documents having contents of different 
topics and the like. Therefore, as above, documents can be 
classified so that the detailed classification structure of the 
document data can be known. 

According to the fifteenth embodiment, in the document 
classification device and document classification method 
according to the previous embodiments, the document 
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classification result creator 5006 specifies only information 
representing document data, and representative information 
accompanying the document data, as classification result 
information. 

As shown for example in Fig. 55, the classification 
categories are displayed at the head, key words representing 
the categories are displayed next to the classification 
categories, and, for example, the document data name (document 
name) of the document data contained in the divided document 
data belonging to the categories is displayed below the category 
name, as information representing the document data. 
Furthermore, document icons are displayed on the left of the 
document data names. When these document icons are specified, 
the contents of the document data are displayed. 

Furthermore, document data names of divided document data 
having a high degree of similarity to the category 
representative value are arranged at the head (left side) of 
the list of document data names. Furthermore, when multiple 
divided document data created from the same document data belong 
to the same classification category, only a document data name 
corresponding to the divided document data having the highest 
degree of similarity is displayed. The key words are words 
which appear frequently. 

Thus, according to the fifteenth embodiment, since only 
information representing document data, and representative 



information accompanying the document data, are displayed as 
the classification result information, the user can easily 
comprehend the overall classification structure of the document 
data in detail. 

According to the sixteenth embodiment of the present 
invention, in addition to specifying the document 
classification result as in the fifteenth embodiment, 
information representing divided document data and information 
accompanying the divided document data are also displayed. 

As shown for example in Fig. 5 6, the classification 
categories are displayed at the head, key words representing 
the categories are displayed next to the classification 
categories, and, for example, the document data name (document 
name) of the document data contained in the divided document 
data belonging to the categories is displayed below the category 
name, as information representing the document data. 

Furthermore, document icons are displayed on the left of 
the document data names . When the document icons are specified, 
the contents of the document data are displayed. Moreover, 
divided document icons are displayed on the right of the 
document data names. The position of divided document data in 
the document data, and the number of divided documents in the 
document data, are displayed in the divided document icons . The 
divided document data in the document data can be displayed by 
specifying a divided document icon. 



Furthermore, document data names of divided document data 
having a high degree of similarity to the category 
representative value are arranged at the head of the list of 
document data names. Furthermore, when multiple divided 
5 document data created from the same document data belong to the 
same classification category, only a document data name 
corresponding to the divided document data having the highest 
degree of similarity is displayed. 

Thus, according to the sixteenth embodiment, since only 

10 information representing document data, representative 
information accompanying the document data, and information 
representing divided document data, representative 
information accompanying the divided document data, are 
displayed as the classification result information, the user 

15 can easily comprehend the overall classification structure of 
the document data in detail, and can easily comprehend which 
document data has been classified in which category, and the 
like. 

The document classification device and document 
20 classification method of the present invention have been 
explained above, and programs for executing the document 
classification method can be recorded on a detachable and 
computer-readable recording medium, and the document 
classification according to the present invention can be 
25 carried out by the recording medium within the above-mentioned 



data processing device. 

As described above, according to one aspect of this 
invention, the document processor of the present invention 
comprises a document memory for storing input document data; 
5 a selection unit for selecting all or part of document data 
stored in the documents memory; a characteristics extraction 
unit for extracting data relating to characteristics of letter 
rows from all or part of the document data selected by the 
selection unit; a work processing unit for work-processing all 

10 or part of the document data based on the data relating to 
characteristics of letter rows extracted by the characteristics 
extraction unit; and an output unit for outputting all or part 
of the document data work-processed by the work processing unit . 
Consequently, when analyzing documents according to their 

15 meanings, rather than merely outputting the result of the 
analysis, the entire information analysis operation can be 
supported. 

Further, the output unit comprises an item value set unit 
for setting a plurality of item values based on the contents 

20 of all or part of the document data work-processed by the 
work-processing unit; and a totalization unit for totalizing 
all or part of the document data for each item value set by the 
item value set unit. Furthermore, the output unit outputs all 
or part of the document data in the format of a table having 

25 an item value as at least one axis. Consequently, the result 



of the work-processing can. easily be expressed in a cross table, 
and the contents of the information can easily be grasped. 
Therefore , when analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 
5 entire information analysis operation can be supported. 

Further, the output unit outputs all or part of the 
document data work-processed by the work processing unit 
together with all or part of the document data in its state prior 
to work-processing by the work processing unit. Consequently, 

10 data to be work-processed and other data can be displayed 
simultaneously and identified, whereby the range of the 
work-processing to be carried out can be accurately and easily 
determined. Therefore, when analyzing documents according to 
their meanings, rather than merely outputting the result of the 

15 analysis, the entire information analysis operation can be 
supported. 

Further, the document memory also stores all or part of 
the document data work-processed by the work processing unit. 
Consequently, since other data can be handled simultaneously, 
20 when thereafter analyzing documents according to their meanings, 
rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the selection unit further selects all or part 
of the document data output by the output unit. Consequently, 
25 since all or part of the document data output by the output unit 



can be selected for analysis, a wide variety of information can 
be analyzed with high precision. Therefore, when analyzing 
documents according to their meanings, rather than merely 
outputting the result of the analysis, the entire information 
analysis operation can be supported. 

Further, the document memory further stores data relating 
to contents of the work processing. Consequently, not only can 
loss of data relating to the contents of work-processing can 
be prevented and the data managed easily, but also the 
relationship between settings used in the work-processing and 
the processed result can be determined. Therefore, when 
analyzing documents according to their meanings, rather than 
merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

According to the another aspect of this invention, the 
document classification device according to the present 
invention comprises an input unit for inputting document data; 
a language analyzer unit for analyzing document data input by 
the input unit and obtaining language analysis information; a 
vector creation unit for document characteristic vectors for 
the document data based on the language analysis information 
obtained by the language analyzer unit; a classification unit 
for classifying documents based on the degree of similarity 
between document characteristic vectors created by the vector 
creation unit, and creating clusters of documents; a cluster 



characteristics calculation unit for calculating cluster 
characteristics, which are characteristics of clusters of 
documents created by the classification unit; and a 
classification category memory for storing cluster 
5 characteristics, calculated by the cluster characteristics 
calculation unit, as constituent elements of classification 
categories. Consequently, it is possible to obtain clusters, 
and to structure and categorize the clusters based on their 
contents using their degree of similarity to the cluster center, 

10 and the like. Therefore, it is possible to gradually determine 
what kind of contents are contained in a given document cluster. 

According to the another aspect of this invention, the 
document classification device comprises an input unit for 
inputting document data; a language analyzer unit for analyzing 

15 document data input by the input unit and obtaining language 
analysis information; a vector creation unit for creating 
document characteristic vectors for the document data based on 
the language analysis information obtained by the language 
analyzer unit; a classification unit for classifying documents 

20 based on the degree of similarity between document 
characteristic vectors created by the vector creation unit, and 
creating clusters of documents; a cluster characteristics 
calculation unit for calculating cluster characteristics, 
which are characteristics of clusters of documents created by 

25 the classification unit; a display unit for displaying the 



cluster characteristics calculated by the cluster 
characteristics calculation unit; a cluster selection 
specification unit for selecting predetermined clusters from 
cluster of documents created by the classification unit; and 
5 a classification category memory for storing cluster 
characteristics, calculated by the cluster characteristics 
calculation unit, as constituent elements of classification 
categories. Consequently, only selected clusters are used, 
making it possible to structure and categorize to clusters in 
10 a manner closer to that desired by the operator. Therefore, 
it is possible to gradually determine what kind of contents are 
contained in a given document cluster. 

Further, the document classification device of the 
present invention described above further comprises a document 
15 characteristic vector memory for storing document 
characteristic vectors created by vector creation unit; and a 
vector correction unit for correcting document characteristic 
vectors stored in the document characteristic vector memory, 
so that document characteristic vectors of documents belonging 
20 to clusters selected by the cluster selection unit are deleted. 
Furthermore, the classification unit classifies documents 
based on the document characteristic vectors corrected by the 
vector correction unit. Consequently, the effects of clusters 
which are already known can be eliminated, and new clusters can 
25 be created. Therefore, it is possible to gradually determine 



what kind of contents are contained in a given document cluster. 

Further, the document classification device of the 
present invention described above further comprises a document 
characteristic vector memory for storing document 
5 characteristic vectors created by vector creation unit; and a 
document expression space correction unit for correcting 
document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
the document characteristic vectors memory, based on a 

10 characteristics amount calculated from clusters selected by the 
cluster selection unit. Furthermore, the classification unit 
classifies documents based on the degree of similarity between 
document characteristic vectors created by the vector creation 
unit, using the document expression space corrected by the 

15 document expression space correction unit. Consequently, 
cluster characteristics selected by the operator in the 
previous classification can be eliminated from the next 
classification, enabling new clusters to be created. 
Therefore, it is possible to gradually determine what kind of 

20 contents are contained in a given document cluster. 

Further, the document classification device of the 
present invention described above further comprises a document 
characteristic vector memory for storing document 
characteristic vectors created by vector creation unit; and a 

25 document expression space correction unit for correcting 



document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
the document characteristic vectors memory, based on a 
characteristics amount calculated from clusters selected by the 
5 cluster selection unit. Furthermore, the classification unit 
classifies documents based on the degree of similarity between 
document characteristic vectors created by the vector creation 
unit, using the document expression space corrected by the 
document expression space correction unit. Consequently, 

10 influences of the known cluster can be eliminated and cluster 
characteristics selected by the operator in the previous 
classification can be eliminated from the next classification, 
enabling new clusters to be created. Therefore, it is possible 
to gradually determine what kind of contents are contained in 

15 a given document cluster. 

Further, the document classification device of the 
present invention described in above further comprises a 
selection information appending unit for appending selection 
information showing the fact of selection when all or part of 

20 the documents belonging to a cluster of documents created by 
the classification unit have been selected. Furthermore, the 
display unit displays the cluster characteristics, and also 
displays the selection information appended by the selection 
information appending unit. Consequently, it is possible to 

25 improve the ability to identify documents used on multiple 
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occasions, and the ability to identify documents which have not 
been selected at all. Therefore, it is possible to gradually 
determine what kind of contents are contained in a given 
document cluster. 
5 Further, the classification category memory stores 

cluster characteristics and/or information created by an 
operator, in addition to all or part of the documents belonging 
to a cluster of documents selected by the selection 
specification unit, as constituent elements of classification 

10 categories. Consequently, the contents of clusters can be 
easily recognized, and in addition, the operator can easily 
create his own classification categories, thereby improving the 
usefulness of the classification categories . Therefore, it is 
possible to gradually determine what kind of contents are 

15 contained in a given document cluster. 

According to still another aspect of this invention, the 
document classification device for classifying document 
clusters in accordance with contents thereof of the present 
invention comprises a document input unit for inputting 

20 document data groups; a document dividing unit for dividing 
document data into one or multiple divided document data based 
on a predetermined reference; a document-divided document map 
creation unit for creating a map showing the correspondence 
between the document data and the divided document data; a 

25 divided document classification unit for classifying the 



divided document data; a divided document classification result 
creation unit for creating divided document classification 
result information based on a classification result of the 
divided document classification unit; and a document 
5 classification result creation unit for creating 
classification result information of the above document data 
using the document-divided document map and the divided 
document classification result information. Consequently, 
when one document contains multiple topics and meanings, these 

10 can be classified into categories according to specific topics 
and meanings, so that the classifications do not differ from 
categories desired by a user, thereby enabling the user to 
easily comprehend the classification categories . Furthermore, 
since the positions of the divided documents in documents prior 

15 to division (documents belonging to the clusters) is displayed, 
the user is able to efficiently read the parts of the document 
clusters he or she wishes to read. 

Further, the document classification device of the 
present invention described above further comprises a document 

20 save unit for saving the document data; a divided document save 
unit for saving the divided document data; and a document- 
divided document map save unit for saving a document-divided 
document map created by the document-divided document map 
creation unit. Consequently, for a single document data, it 

25 is possible to efficiently determine classification results 



having different parameters such as the number of 
classifications, the classification method, and the settings 
used in the classifications, without recreating the divided 
document data and the document-divided document map. 
5 Furthermore, by classifying the document data and saving the 
data needed to create the classification result, the user is 
free to take more time over the classification, and to re- 
analyze previously classified documents within a given period 
of time. 

10 Further, the document classification device of the 

present invention described above further comprises a divided 
document classification result save unit for saving divided 
document classification result information created by the 
divided document classification result creation unit. 

15 Consequently, in addition to the effects achieved by the 
specific arrangement of the present invention described above, 
after one classification has been carried out, the result of 
that classification can be expressed in a variety of formats 
such as text, charts, graphs, and the like. Furthermore, by 

20 saving the divided document classification result information, 
the user is free to take more time over classifications and 
analysis of classification results, and to re-analyze 
previously classified documents in a variety of formats within 
a given period of time. 

25 Further, the multiple divided document data created by 



the document dividing unit contains the document data in its 
state prior to being divided. Consequently, in addition to a 
classification structure of detailed document data, obtained 
by classifying the divided document data, the user is able to 
obtain a classification structure fusing a schematic macro 
classification as a result classifying the document data itself 
prior to division. 

Further, the document dividing unit divides document data 
based on information relating to the structure of the document 
data. Consequently, division and the like of different topics 
can be carried out, whereby documents can be classified in such 
a manner that the detailed classification structures of their 
document data can be known. 

Further, the document classification device further 
comprises a document element extraction unit for extracting 
elements in the document data; an element-accompanying 
information extraction unit for extracting element- 
accompanying information accompanying the elements extracted 
by the document element extraction unit. Furthermore, the 
document dividing unit divides the document data using elements 
extracted by the document element extraction unit, or the 
elements and element-accompanying information extracted by the 
element-accompanying information extraction unit. 
Consequently, documents can be classified so that the detailed 
classification structure of the document data can be known. 
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Further, the document dividing unit divides document data 
in compliance with a specified specification range. 
Consequently, documents can be classified in accordance with 
the wishes of the user, and so that the detailed classification 
5 structure of the document data can be known. 

Further, the document dividing unit divides document data 
based on the number of letters, the number of sentences, or both 
the number of letters and the number of sentences . Consequently, 
there is an increased capability to classify different 
10 documents having contents of different topics and the like. 
Therefore, as above, documents can be classified so that the 
detailed classification structure of the document data can be 
known . 

Further, the document classification result creation 
15 unit extracts and presents information showing document data, 
and representative information accompanying the document data, 
as classification result information. Consequently, the user 
is able to determine a detailed schematic structure or overall 
structure of the document data. 
20 Further, the document classification result creation 

unit extracts and presents information showing divided document 
data, and representative information accompanying the divided 
document data, as classification result information. 
Consequently, the user is able to determine a detailed schematic 
25 structure or overall structure of the document data. In 



addition, the user can easily determine which divided document 
has been classified in a given category. 

According to still another aspect of this invention, the 
document processing method of the present invention outputs 
multiple input document data in order to display or print the 
document data in a predetermined format, and comprises the steps 
of storing input document data; selecting all or part of the 
document data stored in the storing step; extracting data 
relating to characteristics of letter rows from all or part of 
the document data selected by the selection step; work- 
processing all or part of the document data based on the data 
relating to characteristics of letter rows extracted in the 
characteristics extraction step; and outputting all or part of 
the document data work-processed in the work processing step. 
Consequently, when analyzing documents according to their 
meanings, rather than merely outputting the result of the 
analysis, the entire information analysis operation can be 
supported . 

Further, the step of outputting comprises the steps of 
setting a plurality of item values based on the contents of all 
or part of the document data work-processed in the work- 
processing step; and totalizing all or part of the document data 
for each item value set in the item value set step; and outputs 
all or part of the document data in the format of a table having 
an item value as at least one axis. Consequently, the result 
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of the work-processing can easily be expressed in a cross table, 
and the contents of the information can easily be grasped. 
Therefore, when analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 
5 entire information analysis operation can be supported. 

Further, the step of outputting further comprises 
outputting all or part of the document data work-processed in 
the work processing step together with all or part of the 
document data in its state prior to work-processing in the work 

10 processing step. Consequently, data to be work-processed and 
other data can be displayed simultaneously and identified, 
whereby the range of the work-processing to be carried out can 
be accurately and easily determined. Therefore, when 
analyzing documents according to their meanings, rather than 

15 merely outputting the result of the analysis, the entire 
information analysis operation can be supported. 

Further, the step of storing further comprises storing 
all or part of the document data work-processed in the work 
processing step . Consequently, since other data can be handled 

20 simultaneously, when thereafter analyzing documents according 
to their meanings, rather than merely outputting the result of 
the analysis, the entire information analysis operation can be 
supported . 

Further, the step of selecting further comprises 
25 selecting all or part of the document data output in the output 



step. Consequently, since all or part of the document data 
output in the output step can be selected for analysis, a wide 
variety of information can be analyzed with high precision. 
Therefore, when analyzing documents according to their meanings , 
5 rather than merely outputting the result of the analysis, the 
entire information analysis operation can be supported. 

Further, the step of storing a document further comprises 
storing data relating to contents of the work processing. 
Consequently, not only can loss of data relating to the contents 

10 of work-processing can be prevented and the data managed easily, 
but also the relationship between settings used in the 
work-processing and the processed result can be determined. 
Therefore, when analyzing documents according to their meanings , 
rather than merely outputting the result of the analysis, the 

15 entire information analysis operation can be supported. 

According to still another aspect of this invention, the 
document classification method of the present invention 
comprises the steps of inputting document data; language- 
analyzing document data input in the step of inputting and 

20 obtaining language analysis information; creating document 
characteristic vectors for the document data based on the 
language analysis information obtained in the step of 
language-analyzing; classifying documents based on the degree 
of similarity between document characteristic vectors created 

25 in the step of creating vectors, and creating clusters of 



documents; calculating cluster characteristics, being 
characteristics of clusters of documents created in the step 
of classifying; and storing cluster characteristics, 
calculated in the step of calculating cluster characteristics, 
as constituent elements of classification categories. 
Consequently, it is possible to obtain clusters, and to 
structure and categorize the clusters based on their contents 
using their degree of similarity to the cluster center, and the 
like. Therefore, it is possible to gradually determine what 
kind of contents are contained in a given document cluster. 

According to still another aspect of this invention, the 
document classification method of the present invention 
comprises the steps of inputting document data; language- 
analyzing document data input in the step of inputting and 
obtaining language analysis information; creating document 
characteristic vectors for the document data based on the 
language analysis information obtained in the step of 
language-analyzing; classifying documents based on the degree 
of similarity between document characteristic vectors created 
in the step of creating vectors, and creating clusters of 
documents; calculating cluster characteristics, which are 
characteristics of clusters of documents created in the step 
of classifying; displaying the cluster characteristics 
calculated in the step of calculating cluster characteristics; 
selecting predetermined clusters from cluster of documents 



created in the step of classifying; and storing cluster 
characteristics, calculated in the step of calculating cluster 
characteristics, as constituent elements of classification 
categories. Consequently, only selected clusters are used, 
5 making it possible to structure and categorize to clusters in 
a manner closer to that desired by the operator. Therefore, 
it is possible to gradually determine what kind of contents are 
contained in a given document cluster. 

Further, the document classification method of the 

10 present invention described above further comprises a step of 
correcting document characteristic vectors stored in the step 
of storing document characteristic vectors, so that document 
characteristic vectors of documents belonging to clusters 
selected by the step of selecting clusters are deleted. 

15 Furthermore, the step of classifying comprises classifying 
documents based on the document characteristic vectors 
corrected by the step of correcting vectors . Consequently, the 
effects of clusters which are already known can be eliminated, 
and new clusters can be created. Therefore, it is possible to 

20 gradually determine what kind of contents are contained in a 
given document cluster. 

Further, the document classification method of the 
present invention described above further comprises a step of 
correcting document expression space when determining the 

25 degree of similarity between document characteristic vectors 



stored in the step of storing document characteristic vectors, 
based on a characteristics amount calculated from clusters 
selected in the step of selecting clusters, and the step of 
classifying comprises classifying documents based on the degree 
5 of similarity between document characteristic vectors created 
in the step of creating vectors, using the document expression 
space corrected in the step of correcting the document 
expression space. Consequently, cluster characteristics 
selected by the operator in the previous classification can be 

10 eliminated from the next classification, enabling new clusters 
to be created. Therefore, it is possible to gradually determine 
what kind of contents are contained in a given document cluster. 

Further, the document classification method of the 
present invention described above further comprises the steps 

15 of correcting document expression space when determining the 
degree of similarity between document characteristic vectors 
stored in the step of storing document characteristic vectors, 
based on a characteristics amount calculated from clusters 
selected in the step of selecting clusters. Furthermore, the 

20 step of classifying comprises classifying documents based on 
the degree of similarity between document characteristic 
vectors created in the step of creating vectors, using the 
document expression space corrected in the step of correcting 
the document expression space. Consequently, influences of 

25 the known cluster can be eliminated and cluster characteristics 



selected by the operator in the previous classification can be 
eliminated from the next classification, enabling new clusters 
to be created. Therefore, it is possible to gradually determine 
what kind of contents are contained in a given document cluster. 
5 Further, the document classification method of the 

present invention described above further comprises the steps 
of appending selection information showing the fact of 
selection when all or part of the documents belonging to a 
cluster of documents created in the step of classifying have 

10 been selected. Furthermore, the step of displaying comprises 
displaying the cluster characteristics, and displaying the 
selection information appended in the step of appending 
selection information. Consequently, it is possible to 
improve the ability to identify documents used on multiple 

15 occasions, and the ability to identify documents which have not 
been selected at all. Therefore, it is possible to gradually 
determine what kind of contents are contained in a given 
document cluster. 

Further, the step of creating classification categories 

20 comprises creating cluster characteristics and/or information 
created by an operator, in addition to all or part of the 
documents belonging to a cluster of documents selected in the 
step of specifying selection, as constituent elements of 
classification categories. Consequently, the contents of 

25 clusters can be easily recognized, and in addition, the operator 



can easily create his own classification categories, thereby 
improving the usefulness of the classification categories. 
Therefore, it is possible to gradually determine what kind of 
contents are contained in a given document cluster. 

According to still another aspect of this invention, the 
document classification method according to the present 
invention comprises the steps of inputting document data 
groups; dividing document data into one or multiple divided 
document data based on a predetermined reference; creating a 
map showing the correspondence between the document data and 
the divided document data; classifying the divided document 
data; creating divided document classification result 
information based on the classification result of classifying 
the divided documents; and creating classification result 
information of the document data using the document-divided 
document map and the divided document classification result 
information- Consequently, when one document contains 
multiple topics and meanings, these can be classified into 
categories according to specific topics and meanings, so that 
the classifications do not differ from categories desired by 
a user, thereby enabling the user to easily comprehend the 
classification categories. Furthermore, since the positions 
of the divided documents in documents prior to division 
(documents belonging to the clusters) is displayed, the user 
is able to efficiently read the parts of the document clusters 



he or she wishes to read. 

According to still another aspect of this invention, a 
computer-readable recording medium of the present invention 
stores programs for executing the above-described document 
5 classification method on a computer, thereby making the program 
readable mechanically, and enabling the operation of the 
document classification method to be executed by a computer. 

Although the invention has been described with respect 
to a specific embodiment for a complete and clear disclosure, 
10 the appended claims are not to be thus limited but are to be 
construed as embodying all modifications and alternative 
constructions that may occur to one skilled in the art which 
fairly fall within the basic teaching herein set forth. 
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WHAT TS CLAIMED IS : 

1. A document processor which displays and prints in a 
predetermined format a plurality of document data input thereto, 
comprising : 

5 document memory which stores input document data; 

selection unit which selects all or part of document data 
stored in said documents memory; 

characteristics extraction unit which extracts data 
relating to characteristics of letter rows from all or part of 
10 the document data selected by said selection unit; 

work processing unit which work-processes all or part of 
the document data based on the data relating to characteristics 
of letter rows extracted by said characteristics extraction 
unit; and 

15 output unit which outputs all or part of the document data 

work-processed by said work processing unit. 

2 . The document processor according to claim 1, wherein said 
output unit comprises item value set unit which sets a plurality 
20 of item values based on the contents of all or part of the 
document data work-processed by said work-processing unit; and 
totalization unit which totalizes all or part of the document 
data for each item value set by said item value set unit; 

said output unit outputs all or part of the document data 
25 in the format of a table having an item value as at least one 
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axis . 



3 . The document processor according to claim 1, wherein said 
output unit outputs all or part of the document data work- 

5 processed by said work processing unit together with all or part 
of the document data in its state prior to work-processing by 
said work processing unit. 

4 . The document processor according to claim 1, wherein said 
10 document memory further stores all or part of the document data 

work-processed by said work processing unit. 

5. The document processor according to claim 1, wherein said 
selection unit further selects all or part of the document data 

15 output by said output unit. 

6. The document processor according to claim 1, wherein said 
document memory further stores data relating to contents of the 
work processing. 

20 

7. A document classification device which classifies 
documents based on contents thereof comprising: 

input unit which inputs document data; 

language analyzer unit which analyzes document data input 
25 by said input unit and obtains language analysis information; 
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vector creation unit which obtains document 
characteristic vectors for the document data based on the 
language analysis information obtained by said language 
analyzer unit; 

5 classification unit which classifies documents based on 

the degree of similarity between document characteristic 
vectors created by said vector creation unit, and creating 
clusters of documents; 

cluster characteristics calculation unit which 
10, calculates cluster characteristics, which are characteristics 
of clusters of documents created by said classification unit; 
and 

classification category memory which stores cluster 
characteristics, calculated by said cluster characteristics 
15 calculation unit, as constituent elements of classification 
categories . 

8. A document classification device which classifies 
documents based on contents thereof comprising: 
20 input unit which inputs a document data; 

language analyzer unit which analyzes document data input 
by said input unit and obtains language analysis information; 

vector creation unit which creates document 
characteristic vectors for the document data based on the 
25 language analysis information obtained by said language 



analyzer unit; 

classification unit which classifies documents based on 
the degree of similarity between document characteristic 
vectors created by said vector creation unit, and creates 
5 clusters of documents; 

cluster characteristics calculation unit which 
calculates cluster characteristics, which are characteristics 
of clusters of documents created by said classification unit; 
display unit which displays the cluster characteristics 
10 calculated by said cluster characteristics calculation unit; 

cluster selection specification unit which selects 
predetermined clusters from cluster of documents created by 
said classification unit; and 

classification category memory which stores cluster 
15 characteristics, calculated by said cluster characteristics 
calculation unit, as constituent elements of classification 
categories - 

9. The document classification device according to claim 8, 
20 further comprising document characteristic vector memory which 

stores document characteristic vectors created by vector 

creation unit; and 

vector correction unit which corrects document 

characteristic vectors stored in said document characteristic 
25 vector memory, so that document characteristic vectors of 



documents belonging to clusters selected by said cluster 
selection unit are deleted; 

said classification unit which classifies documents 
based on the document characteristic vectors corrected by said 
5 vector ' correction unit. 

10. The document classification device according to claim 8, 
further comprising document characteristic vector memory which 
stores document characteristic vectors created by vector 

10 creation unit; and 

document expression space correction unit which corrects 
document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
said document characteristic vectors memory, based on a 

15 characteristics amount calculated from clusters selected by 
said cluster selection unit; 

said classification unit classify the documents based on 
the degree of similarity between document characteristic 
vectors created by said vector creation unit, using the document 

20 expression space corrected by said document expression space 
correction unit. 

11. The document classification device according to claim 9, 
further comprising document characteristic vector memory which 

25 stores document characteristic vectors created by vector 

142 



creation unit; and 

document expression space correction unit which corrects 
the document expression space when determining the degree of 
similarity between document characteristic vectors stored in 
5 said document characteristic vectors memory, based on a 
characteristics amount calculated from clusters selected by 
said cluster selection unit; 

said classification unit classify the documents based on 
the degree of similarity between document characteristic 
10 vectors created by said vector creation unit, using the document 
expression space corrected by said document expression space 
correction unit. 



12. The document classification device according to claim 8, 
15 further comprising selection information appending unit which 
appends selection information showing the fact of selection 
when all or part of the documents belonging to a cluster of 
documents created by said classification unit have been 
selected; 

20 said display unit displays the cluster characteristics, 

and also displays the selection information appended by said 
selection information appending unit. 



13. The document classification device according to claim 8, 
25 wherein said classification category memory stores cluster 
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characteristics and/or information created by an operator, in 
addition to all or part of the documents belonging to a cluster 
of documents selected by said selection specification unit, as 
constituent elements of classification categories. 

5 

14. A document classification device which classifies 
document clusters in accordance with contents thereof 
comprising : 

document input unit which inputs document data groups; 
10 document dividing unit which divides document data into 

one or multiple divided document data based on a predetermined 
reference; 

document-divided document map creation unit which 
creates a map showing the correspondence between the document 
15 data and the divided document data; 

divided document classification unit which classifies 
the divided document data; 

divided document classification result creation unit 
which creates divided document classification result 
20 information based on a classification result of said divided 
document classification unit; and 

document classification result creation unit which 
creates classification result information of the above document 
data using the document-divided document map and the divided 
25 document classification result information. 



15 . The document classification device according to claim 14 , 
further comprising document save unit which saves the document 
data; 

5 divided document save unit which saves the divided 

document data; and 

document-divided document map save unit which saves a 
document-divided document map created by said document-divided 
document map creation unit. 

10 

16 . The document classification device according to claim 15, 
further comprising divided document classification result save 
unit which saves the divided document classification result 
information created by said divided document classification 
15 result creation unit. 



17 . The document classification device according to claim 14, 
wherein a plurality of divided document data created by said 
document dividing unit comprises the document data in its state 

20 prior to being divided. 

18 . The document classification device according to claim 14 , 
wherein said document dividing unit divides document data based 
on information relating to the structure of the document data. 

25 
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19 . The document classification device according to claim 14 , 
further comprising document element extraction unit which 
extracts elements in the document data; 

element-accompanying information extraction unit which 
5 extracts element-accompanying information accompanying the 
elements extracted by said document element extraction unit; 

said document dividing unit divides the document data 
using elements extracted by said document element extraction 
unit, or the elements and element-accompanying information 
10 extracted by said element-accompanying information extraction 
unit . 

20 . The document classification device according to claim 14 , 
wherein said document dividing unit divides the document data 

15 in compliance with a specified specification range. 

21 . The document classification device according to claim 14, 
wherein said document dividing unit divides the document data 
based on the number of letters, the number of sentences, or both 
20 the number of letters and the number of sentences. 

22 . The document classification device according to claim 14, 
wherein said document classification result creation unit 
extracts and presents information showing document data, and 
25 representative information accompanying the document data, as 
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classification result information. 

23 . The document classification device according to claim 22 , 
wherein said document classification result creation unit 
5 extracts and presents information showing divided document data, 
and representative information accompanying the divided 
document data, as classification result information. 

24. A document processing method which outputs a plurality 
10 of input document data in order to display or print the document 
data in a predetermined format, comprising the steps of: 
storing input document data; 

selecting all or part of the document data stored in the 
storing step; 

15 extracting data relating to characteristics of letter 

rows from all or part of the document data selected in the 
selection step; 

work-processing all or part of the document data based 
on the data relating to characteristics of letter rows extracted 
20 in the characteristics extraction step; and 

outputting all or part of the document data work- 
processed in the work-processing step. 



25. The document processing method according to claim 24, 
25 wherein the step of outputting comprises the steps of setting 
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a plurality of item values based on the contents of all or part 
of the document data work-processed in the work-processing 
step; and totalizing all or part of the document data for each 
item value set by in the item value setting step; and 
5 outputs all or part of the document data in the format 

of a table having an item value as at least one axis . 

26. The document processing method according to claim 24, 
wherein the step of outputting comprises outputting all or part 

10 of the document data work-processed in the work-processing step 
together with all or part of the document data in its state prior 
to work-processing in the work-processing step. 

27. The document processing method according to claim 24, 
15 wherein the step of storing further comprises storing all or 

part of the document data work-processed in the work-processing 
step . 

28. The document processing method according to claim 24, 
20 wherein the step of selecting further comprises selecting all 

or part of the document data output in the output step. 

29. The document processing method according to claim 24, 
wherein the step of storing a document further comprises storing 

25 data relating to contents of the work processing. 
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30. A document classification method which classifies 
documents based on contents thereof comprising the steps of: 

inputting a document data; 
5 language-analyzing document data input in the step of 

inputting and obtaining language analysis information; 

creating document characteristic vectors for the 
document data based on the language analysis information 
obtained in the step of language-analyzing; 
10 classifying documents based on the degree of similarity 

between document characteristic vectors created in the step of 
creating vectors, and creating clusters of documents; 

calculating cluster characteristics, being 

characteristics of clusters of documents created in the step 
15 of classifying; and 

storing cluster characteristics, calculated in the step 
of calculating cluster characteristics, as constituent 
elements of classification categories. 

20 31. A document classification method of classifying 
documents based on contents thereof, comprising the steps of: 
inputting a document data; 

language-analyzing document data input in the step of 
inputting and obtaining language analysis information; 
25 creating document characteristic vectors for the 
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document data based on the language analysis information 
obtained in the step of language-analyzing; 

classifying documents based on the degree of similarity 
between document characteristic vectors created in the step of 
5 creating vectors, and creating clusters of documents; 

calculating cluster characteristics, which are 
characteristics of clusters of documents created in the step 
of classifying; 

displaying the cluster characteristics calculated in the 
10 step of calculating cluster characteristics ; 

selecting predetermined clusters from cluster of 
documents created in the step of classifying; and 

storing cluster characteristics, calculated in the step 
of calculating cluster characteristics, as constituent 
15 elements of classification categories. 

32. The document classification method according to claim 31, 
further comprising the step of correcting document 
characteristic vectors stored in the step of storing document 

20 characteristic vectors, so that document characteristic 
vectors of documents belonging to clusters selected by the step 
of selecting clusters are deleted; 

the step of classifying comprising classifying documents 
based on the document characteristic vectors corrected by the 

25 step of correcting vectors. 



33 . The document classification method according to claim 31, 
further comprising a step of correcting document expression 
space when determining the degree of similarity between 
5 document characteristic vectors stored in the step of storing 
document characteristic vectors, based on a characteristics 
amount calculated from clusters selected in the step of 
selecting clusters; 

the step of classifying comprising classifying documents 
10 based on the degree of similarity between document 
characteristic vectors created in the step of creating vectors, 
using the document expression space corrected in the step of 
correcting the document expression space. 

15 34. The document classification method according to claim 32 , 
further comprising a step of correcting document expression 
space when determining the degree of similarity between 
document characteristic vectors stored in the step of storing 
document characteristic vectors, based on a characteristics 

20 amount calculated from clusters selected in the step of 
selecting clusters; 

the step of classifying comprising classifying documents 
based on the degree of similarity between document 
characteristic vectors created in the step of creating vectors, 

25 using the document expression space corrected in the step of 



correcting the document expression space. 

35 . The document classification method according to claim 31, 
further comprising the steps of appending selection information 
5 showing the fact of selection when all or part of the documents 
belonging to a cluster of documents created in the step of 
classifying have been selected; 

the step of displaying comprising displaying the cluster 
characteristics, and also displaying the selection information 
10 appended in the step of appending selection information. 

36 . The document classification device according to claim 31, 
wherein the step of creating classification categories 
comprises creating cluster characteristics and/or information 
15 created by an operator, in addition to all or part of the 
documents belonging to a cluster of documents selected in the 
step of specifying selection, as constituent elements of 
classification categories. 

20 37. A document classification method which classifies 
document clusters in accordance with contents thereof 
comprising the steps of: 

inputting document data groups; dividing document data 
into one or multiple divided document data based on a 
25 predetermined reference; creating a map showing the 
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correspondence between the document data and the divided 
document data; classifying the divided document data; creating 
divided document classification result information based on the 
classification result of classifying the divided documents; and 
5 creating classification result information of the document data 
using the document-divided document map and the divided 
document classification result information. 

38. A computer-readable recording medium in which is stored 
10 programs for executing a document classification method, which 
document classification method comprising the steps of: 
storing input document data; 

selecting all or part of the document data stored in the 
storing step; 

15 extracting data relating to characteristics of letter 

rows from all or part of the document data selected in the 
selection step; 

work-processing all or part of the document data based 
on the data relating to characteristics of letter rows extracted 
20 in the characteristics extraction step; and 

outputting all or part of the document data work- 
processed in the work-processing step. 



39. A computer-readable recording medium in which is stored 
25 programs for executing a document classification method, which 
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document classification method comprising the steps of: 
inputting a document data; 

language-analyzing document data input in the step of 
inputting and obtaining language analysis information; 
5 creating document characteristic vectors for the 

document data based on the language analysis information 
obtained in the step of language-analyzing; 

classifying documents based on the degree of similarity 
between document characteristic vectors created in the step of 
10 creating vectors, and creating clusters of documents; 

calculating cluster characteristics, being 

characteristics of clusters of documents created in the step 
of classifying; and 

storing cluster characteristics, calculated in the step 
15 of calculating cluster characteristics, as constituent 
elements of classification categories. 

40. A computer-readable recording medium in which is stored 
programs for executing a document classification method, which 
20 document classification method comprising the steps of: 
inputting a document data; 

language-analyzing document data input in the step of 
inputting and obtaining language analysis information; 

creating document characteristic vectors for the 
25 document data based on the language analysis information 
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obtained in the step of language-analyzing; 

classifying documents based on the degree of similarity 
between document characteristic vectors created in the step of 
creating vectors, and creating clusters of documents; 
5 calculating cluster characteristics, which are 

characteristics of clusters of documents created in the step 
of classifying; 

displaying the cluster characteristics calculated in the 
step of calculating cluster characteristics; 
10 selecting predetermined clusters from cluster of 

documents created in the step of classifying; and 

storing cluster characteristics, calculated in the step 
of calculating cluster characteristics, as constituent 
elements of classification categories. 

15 

41. A computer-readable recording medium in which is stored 
programs for executing a document classification method, which 
document classification method comprising the steps of: 

inputting document data groups; dividing document data 

20 into one or multiple divided document data based on a 
predetermined reference; creating a map showing the 
correspondence between the document data and the divided 
document data; classifying the divided document data; creating 
divided document classification result information based on the 

25 classification result of classifying the divided documents; and 



creating classification result information of the document data 
using the document-divided document map and the divided 
document classification result information. 
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ARS TRACT OF THE DISCLOSURE 

In the document processor, a document memory which stores 
input document data; a selector which selects all or part of 
document data stored in the document memory; a characteristics 
5 extractor which extracts data relating to characteristics of 
letter rows from all or part of the document data selected by 
the selector; a work processor which work-processes all or part 
of the document data based on the data relating to 
characteristics of letter rows extracted by the characteristics 
10 extractor; and an output section which outputs all or part of 
the document data work-processed by the work processor are 
provided. 
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57 YEARS IN THE BUSINESS. 

THE PAKISTAN PREMIER SHARIF SAYS HiS 
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